2025-05-02-16-25
Urban Air Mobility as a System of Systems: An LLM-Enhanced Holonic Approach
Abstract
arXiv:2505.00368v1 Announce Type: new Abstract: Urban Air Mobility (UAM) is an emerging System of System (SoS) that faces challenges in system architecture, planning, task management, and execution. Traditional architectural approaches struggle with scalability, adaptability, and seamless resource integration within dynamic and complex environments. This paper presents an intelligent holonic architecture that incorporates Large Language Model (LLM) to manage the complexities of UAM. Holons function semi autonomously, allowing for real time coordination among air taxis, ground transport, and vertiports. LLMs process natural language inputs, generate adaptive plans, and manage disruptions such as weather changes or airspace closures.Through a case study of multimodal transportation with electric scooters and air taxis, we demonstrate how this architecture enables dynamic resource allocation, real time replanning, and autonomous adaptation without centralized control, creating more resilient and efficient urban transportation networks. By advancing decentralized control and AI driven adaptability, this work lays the groundwork for resilient, human centric UAM ecosystems, with future efforts targeting hybrid AI integration and real world validation.
摘要
城市空中交通(UAM)作为一种新兴的系统之系统(SoS),在系统架构、规划、任务管理与执行方面面临诸多挑战。传统架构方法难以在动态复杂环境中实现可扩展性、适应性与无缝资源整合。本文提出一种融合大语言模型(LLM)的智能整体架构,以应对UAM的复杂性。整体单元以半自主方式运行,实现空中出租车、地面交通与垂直起降场间的实时协同。LLM通过处理自然语言输入生成自适应计划,并管理天气变化、空域关闭等突发状况。基于电动滑板车与空中出租车的多式联运案例研究,我们验证了该架构如何在不依赖集中控制的情况下实现动态资源分配、实时重规划与自主适应,从而构建更具韧性与效率的城市交通网络。通过推进去中心化控制与人工智能驱动的适应性,本研究为构建以人为中心的韧性UAM生态系统奠定基础,未来工作将聚焦混合人工智能集成与现实场景验证。
Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems
Abstract
arXiv:2505.00212v1 Announce Type: new Abstract: Failure attribution in LLM multi-agent systems-identifying the agent and step responsible for task failures-provides crucial clues for systems debugging but remains underexplored and labor-intensive. In this paper, we propose and formulate a new research area: automated failure attribution for LLM multi-agent systems. To support this initiative, we introduce the Who&When dataset, comprising extensive failure logs from 127 LLM multi-agent systems with fine-grained annotations linking failures to specific agents and decisive error steps. Using the Who&When, we develop and evaluate three automated failure attribution methods, summarizing their corresponding pros and cons. The best method achieves 53.5% accuracy in identifying failure-responsible agents but only 14.2% in pinpointing failure steps, with some methods performing below random. Even SOTA reasoning models, such as OpenAI o1 and DeepSeek R1, fail to achieve practical usability. These results highlight the task's complexity and the need for further research in this area. Code and dataset are available at https://github.com/mingyin1/Agents_Failure_Attribution
摘要
大语言模型多智能体系统中的故障归因——即识别导致任务失败的智能体及责任步骤——为系统调试提供了关键线索,但该领域仍研究不足且依赖人工。本文提出并定义了一个新的研究方向:大语言模型多智能体系统的自动化故障归因。为支持该研究,我们发布了Who&When数据集,包含127个大语言模型多智能体系统的详细故障日志,并标注了故障关联的特定智能体及关键错误步骤。基于Who&When数据集,我们开发并评估了三种自动化故障归因方法,总结了各自的优缺点。最佳方法在识别责任智能体时达到53.5%准确率,但在定位故障步骤时仅14.2%,部分方法表现甚至低于随机水平。即使如OpenAI o1和DeepSeek R1等最先进的推理模型也未能达到实用要求。这些结果揭示了该任务的复杂性,凸显了进一步研究的必要性。代码与数据集详见https://github.com/mingyin1/Agents_Failure_Attribution
UserCentrix: An Agentic Memory-augmented AI Framework for Smart Spaces
Abstract
arXiv:2505.00472v1 Announce Type: new Abstract: Agentic AI, with its autonomous and proactive decision-making, has transformed smart environments. By integrating Generative AI (GenAI) and multi-agent systems, modern AI frameworks can dynamically adapt to user preferences, optimize data management, and improve resource allocation. This paper introduces UserCentrix, an agentic memory-augmented AI framework designed to enhance smart spaces through dynamic, context-aware decision-making. This framework integrates personalized Large Language Model (LLM) agents that leverage user preferences and LLM memory management to deliver proactive and adaptive assistance. Furthermore, it incorporates a hybrid hierarchical control system, balancing centralized and distributed processing to optimize real-time responsiveness while maintaining global situational awareness. UserCentrix achieves resource-efficient AI interactions by embedding memory-augmented reasoning, cooperative agent negotiation, and adaptive orchestration strategies. Our key contributions include (i) a self-organizing framework with proactive scaling based on task urgency, (ii) a Value of Information (VoI)-driven decision-making process, (iii) a meta-reasoning personal LLM agent, and (iv) an intelligent multi-agent coordination system for seamless environment adaptation. Experimental results across various models confirm the effectiveness of our approach in enhancing response accuracy, system efficiency, and computational resource management in real-world application.
摘要
具备自主决策能力的代理型人工智能正在改变智能环境。通过整合生成式人工智能(GenAI)与多智能体系统,现代AI框架能够动态适应用户偏好、优化数据管理并改进资源分配。本文提出UserCentrix框架——一种基于记忆增强的代理型AI架构,旨在通过动态情境感知决策提升智能空间性能。该框架集成个性化大语言模型(LLM)代理,利用用户偏好与LLM记忆管理机制,提供主动式自适应辅助。此外,系统采用混合分层控制架构,通过集中式与分布式处理的平衡优化实时响应能力,同时保持全局态势感知。通过嵌入记忆增强推理、协同代理协商与自适应编排策略,UserCentrix实现了资源高效的AI交互。我们的核心贡献包括:(i)基于任务紧急性的自组织主动扩展框架;(ii)信息价值(VoI)驱动的决策流程;(iii)具备元推理能力的个人化LLM代理;(iv)支持无缝环境适应的智能多代理协调系统。多模型实验结果表明,该方法在提升实际应用中的响应精度、系统效率及计算资源管理方面具有显著效果。
RAIL in the Wild: Operationalizing Responsible AI Evaluation Using Anthropic's Value Dataset
Abstract
arXiv:2505.00204v1 Announce Type: new Abstract: As AI systems become embedded in real-world applications, ensuring they meet ethical standards is crucial. While existing AI ethics frameworks emphasize fairness, transparency, and accountability, they often lack actionable evaluation methods. This paper introduces a systematic approach using the Responsible AI Labs (RAIL) framework, which includes eight measurable dimensions to assess the normative behavior of large language models (LLMs). We apply this framework to Anthropic's "Values in the Wild" dataset, containing over 308,000 anonymized conversations with Claude and more than 3,000 annotated value expressions. Our study maps these values to RAIL dimensions, computes synthetic scores, and provides insights into the ethical behavior of LLMs in real-world use.
摘要
随着人工智能系统日益嵌入现实应用场景,确保其符合伦理标准变得至关重要。现有AI伦理框架虽强调公平性、透明度和问责制,但往往缺乏可操作的评估方法。本文采用'负责任人工智能实验室'(RAIL)框架提出系统性解决方案,该框架包含八个可量化维度用于评估大语言模型(LLM)的规范性行为。我们将该框架应用于Anthropic公司'野生环境中的价值观'数据集(包含超过308,000条与Claude模型的匿名对话及3,000余条标注的价值表述),通过将这些价值映射至RAIL维度、计算综合评分,揭示了LLM在真实应用场景中的伦理行为特征。
Position Paper: Towards Open Complex Human-AI Agents Collaboration System for Problem-Solving and Knowledge Management
Abstract
arXiv:2505.00018v1 Announce Type: new Abstract: This position paper critically surveys a broad spectrum of recent empirical developments on human-AI agents collaboration, highlighting both their technical achievements and persistent gaps. We observe a lack of a unifying theoretical framework that can coherently integrate these varied studies, especially when tackling open-ended, complex tasks. To address this, we propose a novel conceptual architecture: one that systematically interlinks the technical details of multi-agent coordination, knowledge management, cybernetic feedback loops, and higher-level control mechanisms. By mapping existing contributions, from symbolic AI techniques and connectionist LLM-based agents to hybrid organizational practices, onto this proposed framework (Hierarchical Exploration-Exploitation Net), our approach facilitates revision of legacy methods and inspires new work that fuses qualitative and quantitative paradigms. The paper's structure allows it to be read from any section, serving equally as a critical review of technical implementations and as a forward-looking reference for designing or extending human-AI symbioses. Together, these insights offer a stepping stone toward deeper co-evolution of human cognition and AI capability.
摘要
本立场文件批判性地审视了人机智能体协作领域近期广泛的经验性进展,既凸显了技术成就,也揭示了持续存在的空白。我们注意到当前缺乏一个能够统合这些多样化研究的理论框架,尤其在处理开放式复杂任务时表现尤为明显。为此,我们提出了一种新颖的概念架构:该架构系统性地将多智能体协调、知识管理、控制论反馈循环与高层控制机制等技术细节相互关联。通过将现有贡献(从符号AI技术、基于连接主义大语言模型的智能体到混合组织实践)映射到所提出的框架(分层探索-开发网络)上,我们的方法既促进了对传统方法的修正,也启发了融合定性与定量范式的新研究。本文的结构设计允许从任意章节开始阅读,既可视为对技术实现的批判性综述,也可作为设计或扩展人机共生系统的前瞻性参考。这些见解共同为人类认知与AI能力的深度协同进化奠定了基石。
Can LLMs Help Improve Analogical Reasoning For Strategic Decisions? Experimental Evidence from Humans and GPT-4
Abstract
arXiv:2505.00603v1 Announce Type: new Abstract: This study investigates whether large language models, specifically GPT4, can match human capabilities in analogical reasoning within strategic decision making contexts. Using a novel experimental design involving source to target matching, we find that GPT4 achieves high recall by retrieving all plausible analogies but suffers from low precision, frequently applying incorrect analogies based on superficial similarities. In contrast, human participants exhibit high precision but low recall, selecting fewer analogies yet with stronger causal alignment. These findings advance theory by identifying matching, the evaluative phase of analogical reasoning, as a distinct step that requires accurate causal mapping beyond simple retrieval. While current LLMs are proficient in generating candidate analogies, humans maintain a comparative advantage in recognizing deep structural similarities across domains. Error analysis reveals that AI errors arise from surface level matching, whereas human errors stem from misinterpretations of causal structure. Taken together, the results suggest a productive division of labor in AI assisted organizational decision making where LLMs may serve as broad analogy generators, while humans act as critical evaluators, applying the most contextually appropriate analogies to strategic problems.
摘要
本研究探讨了大型语言模型(特别是GPT4)在战略决策情境下的类比推理能力能否与人类相匹敌。通过采用源目标匹配的新型实验设计,我们发现GPT4通过检索所有可能类比实现了高召回率,但精确度较低,经常基于表面相似性错误应用类比。相比之下,人类参与者表现出高精确度但低召回率,他们选择的类比数量较少但因果关联性更强。这些发现通过将类比推理的评估阶段——匹配识别为需要超越简单检索的准确因果映射的独立步骤,推动了理论发展。当前大型语言模型虽然擅长生成候选类比,但人类在识别跨领域深层结构相似性方面仍具比较优势。错误分析表明,AI错误源于表层匹配,而人类错误则来自对因果结构的误解。综合来看,研究结果揭示了AI辅助组织决策中一种有效的分工模式:大型语言模型可作为广泛的类比生成器,而人类则充当关键评估者,将最符合情境的类比应用于战略问题。
Distributed Retrieval-Augmented Generation
Abstract
arXiv:2505.00443v1 Announce Type: new Abstract: As large language models (LLMs) become increasingly adopted on edge devices, Retrieval-Augmented Generation (RAG) is gaining prominence as a solution to address factual deficiencies and hallucinations by integrating external knowledge. However, centralized RAG architectures face significant challenges in data privacy and scalability. For instance, smart healthcare services often rely on collecting sensitive patient data and building a centralized knowledge base to provide better diagnosis and treatment advice, while privacy concerns significantly impede this process. Besides, maintaining a comprehensive and continuously updated knowledge base is costly, particularly in response to regional epidemics and rapidly mutating viruses. To address these challenges, this paper introduces Distributed Retrieval-Augmented Generation (DRAG), a novel framework that improves data privacy by eliminating the need for a centralized knowledge base and restoring data control to owners. DRAG incorporates a Topic-Aware Random Walk (TARW) algorithm that leverages LLMs to extract query topics and facilitate targeted peer discovery within a peer-to-peer network, enabling efficient knowledge retrieval in decentralized environments. Extensive experiments across three diverse datasets and LLMs demonstrate that DRAG with TARW achieves near-centralized RAG performance by using half as many messages as flooding. The code is available at https://github.com/xuchenhao001/DRAG.
摘要
随着大语言模型(LLMs)在边缘设备上的应用日益广泛,检索增强生成(RAG)技术通过整合外部知识来解决事实性缺陷和幻觉问题的重要性逐渐凸显。然而,集中式RAG架构在数据隐私和可扩展性方面面临重大挑战。例如,智能医疗服务通常依赖收集敏感患者数据并构建集中式知识库以提供更好的诊疗建议,而隐私问题严重阻碍了这一进程。此外,维护一个全面且持续更新的知识库成本高昂,尤其是在应对区域性流行病和快速变异的病毒时。为应对这些挑战,本文提出分布式检索增强生成(DRAG)框架,该框架通过消除集中式知识库需求并将数据控制权归还所有者,显著提升了数据隐私性。DRAG引入了一种主题感知随机游走(TARW)算法,利用LLMs提取查询主题并在点对点网络中实现精准节点发现,从而在去中心化环境中实现高效知识检索。基于三个不同数据集和多种LLMs的广泛实验表明,采用TARW的DRAG仅需洪泛法一半的消息量即可达到接近集中式RAG的性能。代码已开源:https://github.com/xuchenhao001/DRAG。
Combining LLMs with Logic-Based Framework to Explain MCTS
Abstract
arXiv:2505.00610v1 Announce Type: new Abstract: In response to the lack of trust in Artificial Intelligence (AI) for sequential planning, we design a Computational Tree Logic-guided large language model (LLM)-based natural language explanation framework designed for the Monte Carlo Tree Search (MCTS) algorithm. MCTS is often considered challenging to interpret due to the complexity of its search trees, but our framework is flexible enough to handle a wide range of free-form post-hoc queries and knowledge-based inquiries centered around MCTS and the Markov Decision Process (MDP) of the application domain. By transforming user queries into logic and variable statements, our framework ensures that the evidence obtained from the search tree remains factually consistent with the underlying environmental dynamics and any constraints in the actual stochastic control process. We evaluate the framework rigorously through quantitative assessments, where it demonstrates strong performance in terms of accuracy and factual consistency.
摘要
针对人工智能(AI)在序列规划中可信度不足的问题,我们设计了一种基于计算树逻辑引导的大语言模型(LLM)自然语言解释框架,该框架专为蒙特卡洛树搜索(MCTS)算法而设计。由于搜索树的复杂性,MCTS通常被认为难以解释,但我们的框架具有足够的灵活性,能够处理围绕MCTS和应用领域马尔可夫决策过程(MDP)的各种自由形式事后查询与基于知识的询问。通过将用户查询转化为逻辑和变量语句,我们的框架确保从搜索树获取的证据始终与底层环境动态及实际随机控制过程中的任何约束保持事实一致性。通过定量评估对该框架进行严格验证,结果表明其在准确性和事实一致性方面均表现出色。
Open-Source LLM-Driven Federated Transformer for Predictive IoV Management
Abstract
arXiv:2505.00651v1 Announce Type: new Abstract: The proliferation of connected vehicles within the Internet of Vehicles (IoV) ecosystem presents critical challenges in ensuring scalable, real-time, and privacy-preserving traffic management. Existing centralized IoV solutions often suffer from high latency, limited scalability, and reliance on proprietary Artificial Intelligence (AI) models, creating significant barriers to widespread deployment, particularly in dynamic and privacy-sensitive environments. Meanwhile, integrating Large Language Models (LLMs) in vehicular systems remains underexplored, especially concerning prompt optimization and effective utilization in federated contexts. To address these challenges, we propose the Federated Prompt-Optimized Traffic Transformer (FPoTT), a novel framework that leverages open-source LLMs for predictive IoV management. FPoTT introduces a dynamic prompt optimization mechanism that iteratively refines textual prompts to enhance trajectory prediction. The architecture employs a dual-layer federated learning paradigm, combining lightweight edge models for real-time inference with cloud-based LLMs to retain global intelligence. A Transformer-driven synthetic data generator is incorporated to augment training with diverse, high-fidelity traffic scenarios in the Next Generation Simulation (NGSIM) format. Extensive evaluations demonstrate that FPoTT, utilizing EleutherAI Pythia-1B, achieves 99.86% prediction accuracy on real-world data while maintaining high performance on synthetic datasets. These results underscore the potential of open-source LLMs in enabling secure, adaptive, and scalable IoV management, offering a promising alternative to proprietary solutions in smart mobility ecosystems.
摘要
车联网(IoV)生态系统中互联车辆的激增对实现可扩展、实时且隐私保护的交通管理提出了关键挑战。现有集中式车联网解决方案普遍存在高延迟、可扩展性有限及依赖专有人工智能(AI)模型等问题,这为大规模部署(尤其在动态且隐私敏感的环境中)设置了显著障碍。与此同时,大型语言模型(LLMs)在车辆系统中的集成应用仍待深入探索,特别是在提示优化与联邦场景下的有效利用方面。为应对这些挑战,我们提出联邦提示优化交通变换器(FPoTT)——一种利用开源LLMs进行预测性车联网管理的新型框架。FPoTT引入动态提示优化机制,通过迭代优化文本提示以提升轨迹预测性能。该架构采用双层联邦学习范式,将轻量级边缘模型(用于实时推理)与基于云端的LLMs(用于保持全局智能)相结合,并集成基于Transformer的合成数据生成器,以下一代仿真(NGSIM)格式生成多样化高保真交通场景来增强训练。大量实验表明,采用EleutherAI Pythia-1B的FPoTT在真实数据上实现了99.86%的预测准确率,同时在合成数据集上保持优异性能。这些结果印证了开源LLMs在实现安全、自适应、可扩展车联网管理方面的潜力,为智能出行生态系统中的专有解决方案提供了有前景的替代选择。
LangVAE and LangSpace: Building and Probing for Language Model VAEs
Abstract
arXiv:2505.00004v1 Announce Type: cross Abstract: We present LangVAE, a novel framework for modular construction of variational autoencoders (VAEs) on top of pre-trained large language models (LLMs). Such language model VAEs can encode the knowledge of their pre-trained components into more compact and semantically disentangled representations. The representations obtained in this way can be analysed with the LangVAE companion framework: LangSpace, which implements a collection of probing methods, such as vector traversal and interpolation, disentanglement measures, and cluster visualisations. LangVAE and LangSpace offer a flexible, efficient and scalable way of building and analysing textual representations, with simple integration for models available on the HuggingFace Hub. Additionally, we conducted a set of experiments with different encoder and decoder combinations, as well as annotated inputs, revealing a wide range of interactions across architectural families and sizes w.r.t. generalisation and disentanglement. Our findings demonstrate a promising framework for systematising the experimentation and understanding of textual representations.
摘要
我们提出LangVAE——一种基于预训练大语言模型(LLM)实现变分自编码器(VAE)模块化构建的新框架。此类语言模型VAE能够将其预训练组件的知识编码为更紧凑且语义解耦的表示。通过该框架获得的表征可通过配套分析工具LangSpace进行研究,该工具集成了向量遍历与插值、解耦度测量及聚类可视化等探测方法。LangVAE与LangSpace为构建和分析文本表征提供了灵活、高效且可扩展的解决方案,并能便捷集成HuggingFace Hub上的模型。我们通过不同编码器-解码器组合及标注输入的实验,揭示了模型架构家族与规模在泛化能力和解耦特性方面存在的广泛交互关系。研究结果表明,该框架为系统化实验与文本表征理解提供了可行方案。
Toward a digital twin of U.S. Congress
Abstract
arXiv:2505.00006v1 Announce Type: cross Abstract: In this paper we provide evidence that a virtual model of U.S. congresspersons based on a collection of language models satisfies the definition of a digital twin. In particular, we introduce and provide high-level descriptions of a daily-updated dataset that contains every Tweet from every U.S. congressperson during their respective terms. We demonstrate that a modern language model equipped with congressperson-specific subsets of this data are capable of producing Tweets that are largely indistinguishable from actual Tweets posted by their physical counterparts. We illustrate how generated Tweets can be used to predict roll-call vote behaviors and to quantify the likelihood of congresspersons crossing party lines, thereby assisting stakeholders in allocating resources and potentially impacting real-world legislative dynamics. We conclude with a discussion of the limitations and important extensions of our analysis.
摘要
本文通过实证研究表明,基于语言模型集合构建的美国国会议员虚拟模型符合数字孪生的定义。我们重点介绍并概述了一个每日更新的数据集,该数据集收录了每位美国国会议员在任期内发布的所有推文。研究证明,利用针对特定议员定制的数据子集,现代语言模型能够生成与其真实推文高度相似的虚拟推文。我们进一步阐明,这些生成推文可用于预测议员的唱名表决行为,并量化其跨党派投票的可能性,从而帮助利益相关方优化资源配置,并可能对现实立法动态产生影响。最后,我们探讨了本研究的局限性及未来重要的拓展方向。
Jailbreak Detection in Clinical Training LLMs Using Feature-Based Predictive Models
Abstract
arXiv:2505.00010v1 Announce Type: cross Abstract: Jailbreaking in Large Language Models (LLMs) threatens their safe use in sensitive domains like education by allowing users to bypass ethical safeguards. This study focuses on detecting jailbreaks in 2-Sigma, a clinical education platform that simulates patient interactions using LLMs. We annotated over 2,300 prompts across 158 conversations using four linguistic variables shown to correlate strongly with jailbreak behavior. The extracted features were used to train several predictive models, including Decision Trees, Fuzzy Logic-based classifiers, Boosting methods, and Logistic Regression. Results show that feature-based predictive models consistently outperformed Prompt Engineering, with the Fuzzy Decision Tree achieving the best overall performance. Our findings demonstrate that linguistic-feature-based models are effective and explainable alternatives for jailbreak detection. We suggest future work explore hybrid frameworks that integrate prompt-based flexibility with rule-based robustness for real-time, spectrum-based jailbreak monitoring in educational LLMs.
摘要
大型语言模型(LLMs)中的越狱行为会使用户绕过伦理防护措施,威胁其在教育等敏感领域的安全使用。本研究重点检测临床教育平台2-Sigma中的越狱行为,该平台利用LLMs模拟患者互动。我们使用四个与越狱行为高度相关的语言变量,对158组对话中的2300余条提示进行了标注。提取的特征被用于训练多种预测模型,包括决策树、基于模糊逻辑的分类器、提升方法以及逻辑回归。结果表明,基于特征的预测模型始终优于提示工程,其中模糊决策树取得了最佳整体性能。我们的研究证明,基于语言特征的模型是越狱检测中有效且可解释的替代方案。建议未来工作探索混合框架,将基于提示的灵活性与基于规则的鲁棒性相结合,以实现教育类LLMs中基于频谱的实时越狱监测。
Sparks of Tabular Reasoning via Text2SQL Reinforcement Learning
Abstract
arXiv:2505.00016v1 Announce Type: cross Abstract: This work reframes the Text-to-SQL task as a pathway for teaching large language models (LLMs) to reason over and manipulate tabular data--moving beyond the traditional focus on query generation. We propose a two-stage framework that leverages SQL supervision to develop transferable table reasoning capabilities. First, we synthesize detailed chain-of-thought (CoT) traces from real-world SQL queries, providing step-by-step, clause-level supervision that teaches the model how to traverse, filter, and aggregate table fields. Second, we introduce a Group Relative Policy Optimization (GRPO) reinforcement learning objective that connects SQL execution accuracy to generalizable reasoning by encouraging steps that extend beyond task-specific syntax and transfer across datasets. Empirically, our approach improves performance on standard Text-to-SQL benchmarks and achieves substantial gains on reasoning-intensive datasets such as BIRD and CRT-QA, demonstrating enhanced generalization and interpretability. Specifically, the distilled-quantized LLaMA model achieved a 20% increase in accuracy when trained on Text-to-SQL tasks, while Qwen achieved a 5% increase. These results suggest that SQL can serve not only as a target formalism but also as an effective scaffold for learning robust, transferable reasoning over structured data.
摘要
本研究将文本到SQL任务重新定义为培养大型语言模型(LLMs)进行表格数据推理与操作的学习路径,突破了传统查询生成的局限。我们提出一个两阶段框架,利用SQL监督来开发可迁移的表格推理能力:首先,从真实SQL查询合成详细的思维链(CoT)轨迹,提供分步骤、子句级的监督,指导模型如何遍历、筛选和聚合表格字段;其次,引入组相对策略优化(GRPO)强化学习目标,通过鼓励超越任务特定语法且能跨数据集迁移的推理步骤,将SQL执行准确率与泛化推理能力相关联。实验表明,该方法不仅提升了标准文本到SQL基准的性能,更在BIRD和CRT-QA等推理密集型数据集上取得显著进步,展现出增强的泛化能力和可解释性。具体而言,经过文本到SQL任务训练的蒸馏量化LLaMA模型准确率提升20%,Qwen模型提升5%。这些结果表明,SQL不仅能作为目标形式化语言,更能成为学习结构化数据稳健可迁移推理的有效脚手架。
ReCellTy: Domain-specific knowledge graph retrieval-augmented LLMs workflow for single-cell annotation
Abstract
arXiv:2505.00017v1 Announce Type: cross Abstract: To enable precise and fully automated cell type annotation with large language models (LLMs), we developed a graph structured feature marker database to retrieve entities linked to differential genes for cell reconstruction. We further designed a multi task workflow to optimize the annotation process. Compared to general purpose LLMs, our method improves human evaluation scores by up to 0.21 and semantic similarity by 6.1% across 11 tissue types, while more closely aligning with the cognitive logic of manual annotation.
摘要
为实现基于大语言模型(LLMs)的精准全自动化细胞类型注释,我们开发了图结构特征标记数据库,用于检索与差异基因关联的实体以进行细胞重建。进一步设计了多任务工作流程以优化注释过程。与通用LLMs相比,本方法在11种组织类型中将人工评估分数最高提升0.21,语义相似度提高6.1%,同时更贴近人工注释的认知逻辑。
An Empirical Study on Prompt Compression for Large Language Models
Abstract
arXiv:2505.00019v1 Announce Type: cross Abstract: Prompt engineering enables Large Language Models (LLMs) to perform a variety of tasks. However, lengthy prompts significantly increase computational complexity and economic costs. To address this issue, we study six prompt compression methods for LLMs, aiming to reduce prompt length while maintaining LLM response quality. In this paper, we present a comprehensive analysis covering aspects such as generation performance, model hallucinations, efficacy in multimodal tasks, word omission analysis, and more. We evaluate these methods across 13 datasets, including news, scientific articles, commonsense QA, math QA, long-context QA, and VQA datasets. Our experiments reveal that prompt compression has a greater impact on LLM performance in long contexts compared to short ones. In the Longbench evaluation, moderate compression even enhances LLM performance. Our code and data is available at https://github.com/3DAgentWorld/Toolkit-for-Prompt-Compression.
摘要
提示工程使大语言模型(LLMs)能够执行多种任务。然而,冗长的提示会显著增加计算复杂性和经济成本。为解决这一问题,我们研究了六种针对LLMs的提示压缩方法,旨在缩短提示长度的同时保持模型响应质量。本文提出了涵盖生成性能、模型幻觉、多模态任务有效性、词汇省略分析等方面的综合分析。我们在13个数据集上评估了这些方法,包括新闻、科学文章、常识问答、数学问答、长上下文问答以及视觉问答数据集。实验表明,与短上下文相比,提示压缩对长上下文中LLM性能的影响更为显著。在Longbench评估中,适度压缩甚至能提升LLM性能。相关代码和数据可在https://github.com/3DAgentWorld/Toolkit-for-Prompt-Compression获取。
Beyond Public Access in LLM Pre-Training Data
Abstract
arXiv:2505.00020v1 Announce Type: cross Abstract: Using a legally obtained dataset of 34 copyrighted O'Reilly Media books, we apply the DE-COP membership inference attack method to investigate whether OpenAI's large language models were trained on copyrighted content without consent. Our AUROC scores show that GPT-4o, OpenAI's more recent and capable model, demonstrates strong recognition of paywalled O'Reilly book content (AUROC = 82%), compared to OpenAI's earlier model GPT-3.5 Turbo. In contrast, GPT-3.5 Turbo shows greater relative recognition of publicly accessible O'Reilly book samples. GPT-4o Mini, as a much smaller model, shows no knowledge of public or non-public O'Reilly Media content when tested (AUROC 50%). Testing multiple models, with the same cutoff date, helps us account for potential language shifts over time that might bias our findings. These results highlight the urgent need for increased corporate transparency regarding pre-training data sources as a means to develop formal licensing frameworks for AI content training
摘要
通过使用合法获取的34本受版权保护的O'Reilly Media书籍数据集,我们应用DE-COP成员推理攻击方法,调查OpenAI的大型语言模型是否在未经许可的情况下使用了受版权保护的内容进行训练。我们的AUROC评分显示,与OpenAI早期模型GPT-3.5 Turbo相比,其最新且性能更强的模型GPT-4o对付费墙保护的O'Reilly书籍内容表现出较强的识别能力(AUROC = 82%)。相反,GPT-3.5 Turbo对公开可访问的O'Reilly书籍样本表现出相对更高的识别度。而作为一个小得多的模型,GPT-4o Mini在测试中对公开或非公开的O'Reilly Media内容均未显示出识别能力(AUROC ≈ 50%)。通过测试具有相同截止日期的多个模型,我们能够排除时间推移可能导致的语言变化对研究结果的干扰。这些结果凸显了企业迫切需要提高预训练数据源的透明度,以建立AI内容训练的正式授权框架。
Aleph-Alpha-GermanWeb: Improving German-language LLM pre-training with model-based data curation and synthetic data generation
Abstract
arXiv:2505.00022v1 Announce Type: cross Abstract: Scaling data quantity is essential for large language models (LLMs), yet recent findings show that data quality can significantly boost performance and training efficiency. We introduce a German-language dataset curation pipeline that combines heuristic and model-based filtering techniques with synthetic data generation. We use our pipeline to create Aleph-Alpha-GermanWeb, a large-scale German pre-training dataset which draws from: (1) Common Crawl web data, (2) FineWeb2, and (3) synthetically-generated data conditioned on actual, organic web data. We evaluate our dataset by pre-training both a 1B Llama-style model and an 8B tokenizer-free hierarchical autoregressive transformer (HAT). A comparison on German-language benchmarks, including MMMLU, shows significant performance gains of Aleph-Alpha-GermanWeb over FineWeb2 alone. This advantage holds at the 8B scale even when FineWeb2 is enriched by human-curated high-quality data sources such as Wikipedia. Our findings support the growing body of evidence that model-based data curation and synthetic data generation can significantly enhance LLM pre-training datasets.
摘要
扩大数据规模对大型语言模型(LLM)至关重要,但近期研究表明,提升数据质量能显著提高模型性能和训练效率。我们提出了一套德语数据集构建流程,结合启发式与基于模型的过滤技术,并辅以合成数据生成。利用该流程,我们创建了Aleph-Alpha-GermanWeb——一个大规模德语预训练数据集,其数据来源包括:(1)Common Crawl网络数据,(2)FineWeb2,以及(3)基于真实网络数据生成的合成数据。我们通过预训练一个10亿参数的Llama风格模型和一个80亿参数的无分词器层次自回归变换器(HAT)来评估数据集性能。在包括MMMLU在内的德语基准测试中,Aleph-Alpha-GermanWeb相比仅使用FineWeb2展现出显著性能优势。即使FineWeb2补充了维基百科等人工精选的高质量数据源,这种优势在80亿参数规模下依然存在。我们的研究进一步证明:基于模型的数据筛选和合成数据生成能显著提升LLM预训练数据集质量。
CORG: Generating Answers from Complex, Interrelated Contexts
Abstract
arXiv:2505.00023v1 Announce Type: cross Abstract: In a real-world corpus, knowledge frequently recurs across documents but often contains inconsistencies due to ambiguous naming, outdated information, or errors, leading to complex interrelationships between contexts. Previous research has shown that language models struggle with these complexities, typically focusing on single factors in isolation. We classify these relationships into four types: distracting, ambiguous, counterfactual, and duplicated. Our analysis reveals that no single approach effectively addresses all these interrelationships simultaneously. Therefore, we introduce Context Organizer (CORG), a framework that organizes multiple contexts into independently processed groups. This design allows the model to efficiently find all relevant answers while ensuring disambiguation. CORG consists of three key components: a graph constructor, a reranker, and an aggregator. Our results demonstrate that CORG balances performance and efficiency effectively, outperforming existing grouping methods and achieving comparable results to more computationally intensive, single-context approaches.
摘要
在现实世界的语料库中,知识经常在不同文档间重复出现,但由于命名模糊、信息过时或错误等原因常存在不一致性,导致上下文之间形成复杂的相互关系。先前研究表明,语言模型难以应对这种复杂性,通常只能孤立地处理单一因素。我们将这些关系归类为四种类型:干扰性、模糊性、反事实性和重复性。分析表明,现有方法无法同时有效处理所有这些相互关系。为此,我们提出了上下文组织器(CORG),该框架通过将多个上下文组织成独立处理的组别,使模型既能高效找到所有相关答案,又能确保消歧效果。CORG包含三个核心组件:图构造器、重排序器和聚合器。实验结果表明,CORG在性能与效率之间取得了良好平衡,不仅优于现有分组方法,其效果还可与计算量更大的单上下文处理方法相媲美。
Nemotron-Research-Tool-N1: Tool-Using Language Models with Reinforced Reasoning
Abstract
arXiv:2505.00024v1 Announce Type: cross Abstract: Enabling large language models with external tools has become a pivotal strategy for extending their functionality beyond text generation tasks. Prior work typically enhances tool-use abilities by either applying supervised fine-tuning (SFT) to enforce tool-call correctness or distilling reasoning traces from stronger models for SFT. However, both approaches fall short, either omitting reasoning entirely or producing imitative reasoning that limits generalization. Inspired by the success of DeepSeek-R1 in eliciting reasoning through rule-based reinforcement learning, we develop the Nemotron-Research-Tool-N1 series of tool-using language models using a similar training paradigm. Instead of restrictively supervising intermediate reasoning traces distilled from stronger models, Nemotron-Research-Tool-N1 is optimized with a binary reward that evaluates only the structural validity and functional correctness of tool invocations. This lightweight supervision allows the model to autonomously internalize reasoning strategies, without the need for annotated reasoning trajectories. Experiments on the BFCL and API-Bank benchmarks show that Nemotron-Research-Tool-N1-7B and Nemotron-Research-Tool-N1-14B, built on Qwen-2.5-7B/14B-Instruct, achieve state-of-the-art results, outperforming GPT-4o on both evaluations.
摘要
为大型语言模型配备外部工具已成为扩展其文本生成功能之外能力的关键策略。现有研究通常通过两种方式增强工具使用能力:应用监督微调(SFT)确保工具调用的正确性,或从更强模型中蒸馏推理轨迹用于SFT。然而这两种方法均存在不足——前者完全省略推理过程,后者产生的模仿性推理限制了泛化能力。受DeepSeek-R1通过基于规则的强化学习成功激发推理的启发,我们采用类似训练范式开发了Nemotron-Research-Tool-N1系列工具调用语言模型。该模型摒弃了对强模型蒸馏中间推理轨迹的严格监督,转而采用仅评估工具调用结构有效性和功能正确性的二元奖励机制进行优化。这种轻量级监督使模型能自主内化推理策略,而无需标注推理轨迹。在BFCL和API-Bank基准测试中,基于Qwen-2.5-7B/14B-Instruct构建的Nemotron-Research-Tool-N1-7B和Nemotron-Research-Tool-N1-14B均取得最先进成果,在两项评估中超越GPT-4o。
A Method for the Architecture of a Medical Vertical Large Language Model Based on Deepseek R1
Abstract
arXiv:2505.00025v1 Announce Type: cross Abstract: In recent years, despite foundation models like DeepSeek-R1 and ChatGPT demonstrating significant capabilities in general tasks, professional knowledge barriers, computational resource requirements, and deployment environment limitations have severely hindered their application in actual medical scenarios. Addressing these challenges, this paper proposes an efficient lightweight medical vertical large language model architecture method, systematically solving the lightweight problem of medical large models from three dimensions: knowledge acquisition, model compression, and computational optimization. At the knowledge acquisition level, a knowledge transfer pipeline is designed from the fine-tuned DeepSeek-R1-Distill-70B teacher model to the DeepSeek-R1-Distill-7B student model, and Low-Rank Adaptation (LoRA) technology is adopted to precisely adjust key attention layers. At the model compression level, compression techniques including 4-bit weight quantization are implemented while preserving the core representation ability for medical reasoning. At the computational optimization level, inference optimization techniques such as Flash Attention acceleration and continuous batching are integrated, and a professional prompt template system is constructed to adapt to different types of medical problems. Experimental results on medical question-answering datasets show that the method proposed in this paper maintains professional accuracy while reducing memory consumption by 64.7% and inference latency by 12.4%, providing an effective solution for the application of medical large models in resource-constrained environments such as edge computing devices.
摘要
近年来,尽管DeepSeek-R1和ChatGPT等基础模型在通用任务中展现出强大能力,但专业知识壁垒、计算资源需求和部署环境限制严重阻碍了其在真实医疗场景的应用。针对这些挑战,本文提出一种高效的轻量化医疗垂直领域大语言模型架构方法,从知识获取、模型压缩和计算优化三个维度系统性地解决医疗大模型的轻量化问题。在知识获取层面,设计了从微调后的DeepSeek-R1-Distill-70B教师模型到DeepSeek-R1-Distill-7B学生模型的知识迁移流程,并采用低秩自适应(LoRA)技术对关键注意力层进行精准调整。在模型压缩层面,实施了包含4比特权重量化在内的压缩技术,同时保持医疗推理的核心表征能力。在计算优化层面,集成了Flash Attention加速和连续批处理等推理优化技术,并构建了专业提示模板系统以适应不同类型医疗问题。在医疗问答数据集上的实验结果表明,本文提出的方法在保持专业准确性的同时,内存消耗降低64.7%,推理延迟减少12.4%,为医疗大模型在边缘计算设备等资源受限环境中的应用提供了有效解决方案。
Theory of Mind in Large Language Models: Assessment and Enhancement
Abstract
arXiv:2505.00026v1 Announce Type: cross Abstract: Theory of Mind (ToM)-the ability to infer and reason about others' mental states-is fundamental to human social intelligence. As Large Language Models (LLMs) become increasingly integrated into daily life, it is crucial to assess and enhance their capacity to interpret and respond to human mental states. In this paper, we review LLMs' ToM capabilities by examining both evaluation benchmarks and the strategies designed to improve them. We focus on widely adopted story-based benchmarks and provide an in-depth analysis of methods aimed at enhancing ToM in LLMs. Furthermore, we outline promising future research directions informed by recent benchmarks and state-of-the-art approaches. Our survey serves as a valuable resource for researchers interested in advancing LLMs' ToM capabilities.
摘要
心理理论(Theory of Mind, ToM)——即推断和推理他人心理状态的能力——是人类社会智能的基础。随着大语言模型(Large Language Models, LLMs)日益融入日常生活,评估并提升其理解和响应人类心理状态的能力变得至关重要。本文通过考察评估基准和改进策略,系统梳理了LLMs的心理理论能力。我们重点关注广泛采用的故事型基准测试,并对提升LLMs心理理论能力的方法进行了深入分析。此外,基于最新基准测试和最先进方法,我们提出了未来具有前景的研究方向。本综述为致力于推进LLMs心理理论能力的研究者提供了重要参考。
Enhancing Speech-to-Speech Dialogue Modeling with End-to-End Retrieval-Augmented Generation
Abstract
arXiv:2505.00028v1 Announce Type: cross Abstract: In recent years, end-to-end speech-to-speech (S2S) dialogue systems have garnered increasing research attention due to their advantages over traditional cascaded systems, including achieving lower latency and more natural integration of nonverbal cues such as emotion and speaker identity. However, these end-to-end systems face key challenges, particularly in incorporating external knowledge, a capability commonly addressed by Retrieval-Augmented Generation (RAG) in text-based large language models (LLMs). The core difficulty lies in the modality gap between input speech and retrieved textual knowledge, which hinders effective integration. To address this issue, we propose a novel end-to-end RAG framework that directly retrieves relevant textual knowledge from speech queries, eliminating the need for intermediate speech-to-text conversion via techniques like ASR. Experimental results demonstrate that our method significantly improves the performance of end-to-end S2S dialogue systems while achieving higher retrieval efficiency. Although the overall performance still lags behind cascaded models, our framework offers a promising direction for enhancing knowledge integration in end-to-end S2S systems. We will release the code and dataset to support reproducibility and promote further research in this area.
摘要
近年来,端到端语音对话系统因其较传统级联系统的优势而获得越来越多研究关注,这些优势包括实现更低延迟以及更自然地整合情感和说话人身份等非语言信息。然而,这些端到端系统面临关键挑战,特别是在融入外部知识方面——这一能力通常由基于文本的大语言模型中的检索增强生成技术实现。核心难点在于输入语音与检索文本知识之间的模态差异阻碍了有效整合。为解决该问题,我们提出一种新颖的端到端检索增强生成框架,可直接从语音查询中检索相关文本知识,无需通过自动语音识别等技术进行中间语音转文本转换。实验结果表明,我们的方法显著提升了端到端语音对话系统性能,同时实现了更高检索效率。虽然整体性能仍落后于级联模型,但该框架为增强端到端语音对话系统的知识整合提供了可行方向。我们将公开代码和数据集以支持结果复现,并推动该领域的进一步研究。
Keep the General, Inject the Specific: Structured Dialogue Fine-Tuning for Knowledge Injection without Catastrophic Forgetting
Abstract
arXiv:2505.00029v1 Announce Type: cross Abstract: Large Vision Language Models have demonstrated impressive versatile capabilities through extensive multimodal pre-training, but face significant limitations when incorporating specialized knowledge domains beyond their training distribution. These models struggle with a fundamental dilemma: direct adaptation approaches that inject domain-specific knowledge often trigger catastrophic forgetting of foundational visual-linguistic abilities. We introduce Structured Dialogue Fine-Tuning (SDFT), an effective approach that effectively injects domain-specific knowledge while minimizing catastrophic forgetting. Drawing inspiration from supervised fine-tuning in LLMs and subject-driven personalization in text-to-image diffusion models, our method employs a three-phase dialogue structure: Foundation Preservation reinforces pre-trained visual-linguistic alignment through caption tasks; Contrastive Disambiguation introduces carefully designed counterfactual examples to maintain semantic boundaries; and Knowledge Specialization embeds specialized information through chain-of-thought reasoning. Experimental results across multiple domains confirm SDFT's effectiveness in balancing specialized knowledge acquisition with general capability retention. Our key contributions include a data-centric dialogue template that balances foundational alignment with targeted knowledge integration, a weighted multi-turn supervision framework, and comprehensive evaluation across diverse knowledge types.
摘要
大规模视觉语言模型通过广泛的多模态预训练展现出卓越的通用能力,但在融入训练分布之外的专业知识领域时面临显著局限。这些模型存在一个根本性困境:直接注入领域知识的适应方法往往会引发基础视觉-语言能力的灾难性遗忘。我们提出结构化对话微调(SDFT),该方法能有效注入领域知识,同时最大限度减少灾难性遗忘。受大语言模型监督微调和文本到图像扩散模型中主体驱动个性化的启发,我们的方法采用三阶段对话结构:基础保持阶段通过描述任务强化预训练的视觉-语言对齐;对比消歧阶段引入精心设计的反事实样本以维持语义边界;知识专业化阶段通过思维链推理嵌入专业知识。跨多个领域的实验结果证实SDFT在平衡专业知识获取与通用能力保留方面的有效性。我们的核心贡献包括:平衡基础对齐与目标知识整合的数据中心化对话模板、加权多轮监督框架,以及针对多样化知识类型的全面评估。
Learning to Plan Before Answering: Self-Teaching LLMs to Learn Abstract Plans for Problem Solving
Abstract
arXiv:2505.00031v1 Announce Type: cross Abstract: In the field of large language model (LLM) post-training, the effectiveness of utilizing synthetic data generated by the LLM itself has been well-presented. However, a key question remains unaddressed: what essential information should such self-generated data encapsulate? Existing approaches only produce step-by-step problem solutions, and fail to capture the abstract meta-knowledge necessary for generalization across similar problems. Drawing insights from cognitive science, where humans employ high-level abstraction to simplify complex problems before delving into specifics, we introduce a novel self-training algorithm: LEarning to Plan before Answering (LEPA). LEPA trains the LLM to formulate anticipatory plans, which serve as abstract meta-knowledge for problem-solving, before engaging with the intricacies of problems. This approach not only outlines the solution generation path but also shields the LLM from the distraction of irrelevant details. During data generation, LEPA first crafts an anticipatory plan based on the problem, and then generates a solution that aligns with both the plan and the problem. LEPA refines the plan through self-reflection, aiming to acquire plans that are instrumental in yielding correct solutions. During model optimization, the LLM is trained to predict both the refined plans and the corresponding solutions. By efficiently extracting and utilizing the anticipatory plans, LEPA demonstrates remarkable superiority over conventional algorithms on various challenging natural language reasoning benchmarks.
摘要
在大语言模型(LLM)后训练领域,利用模型自身生成的合成数据已被证明具有显著效果。然而,一个关键问题尚未得到解决:这类自生成数据应当包含哪些核心信息?现有方法仅能生成逐步的问题解决方案,却未能捕捉到跨相似问题泛化所需的抽象元知识。受认知科学启发——人类在深入细节前会运用高层抽象来简化复杂问题——我们提出一种新型自训练算法:作答前学习规划(LEPA)。该算法训练大语言模型在应对问题复杂性之前,先构建预期规划作为问题解决的抽象元知识。这种方法不仅勾勒出解决方案的生成路径,还能使模型免受无关细节干扰。在数据生成阶段,LEPA首先基于问题创建预期规划,随后生成与该规划和问题均匹配的解决方案;通过自我反思机制优化规划,旨在获得对生成正确解决方案具有指导价值的规划。在模型优化阶段,大语言模型被训练同时预测优化后的规划及其对应解决方案。通过高效提取和运用预期规划,LEPA在多项具有挑战性的自然语言推理基准测试中展现出超越传统算法的显著优势。
MDD-LLM: Towards Accuracy Large Language Models for Major Depressive Disorder Diagnosis
Abstract
arXiv:2505.00032v1 Announce Type: cross Abstract: Major depressive disorder (MDD) impacts more than 300 million people worldwide, highlighting a significant public health issue. However, the uneven distribution of medical resources and the complexity of diagnostic methods have resulted in inadequate attention to this disorder in numerous countries and regions. This paper introduces a high-performance MDD diagnosis tool named MDD-LLM, an AI-driven framework that utilizes fine-tuned large language models (LLMs) and extensive real-world samples to tackle challenges in MDD diagnosis. Therefore, we select 274,348 individual information from the UK Biobank cohort to train and evaluate the proposed method. Specifically, we select 274,348 individual records from the UK Biobank cohort and design a tabular data transformation method to create a large corpus for training and evaluating the proposed approach. To illustrate the advantages of MDD-LLM, we perform comprehensive experiments and provide several comparative analyses against existing model-based solutions across multiple evaluation metrics. Experimental results show that MDD-LLM (70B) achieves an accuracy of 0.8378 and an AUC of 0.8919 (95% CI: 0.8799 - 0.9040), significantly outperforming existing machine learning and deep learning frameworks for MDD diagnosis. Given the limited exploration of LLMs in MDD diagnosis, we examine numerous factors that may influence the performance of our proposed method, such as tabular data transformation techniques and different fine-tuning strategies.
摘要
重度抑郁症(MDD)影响着全球超过3亿人口,已成为重大公共卫生问题。然而,医疗资源分配不均与诊断方法复杂性导致该疾病在许多国家和地区未能获得充分关注。本文提出一种名为MDD-LLM的高性能诊断工具,该人工智能驱动框架通过微调大语言模型(LLMs)并结合大规模真实世界样本,以解决MDD诊断中的挑战。为此,我们从英国生物银行队列中筛选274,348条个体信息用于方法训练与评估。具体而言,我们设计了一种表格数据转换方法,构建大规模语料库以支持所提方案的训练与验证。为展示MDD-LLM优势,我们开展全面实验,并在多维度评估指标下与现有模型解决方案进行对比分析。实验结果表明,MDD-LLM(70B)取得0.8378的准确率与0.8919的AUC值(95%置信区间:0.8799-0.9040),显著优于现有机器学习与深度学习诊断框架。鉴于LLMs在MDD诊断领域研究尚属有限,我们深入探究了可能影响方法性能的多重因素,包括表格数据转换技术与不同微调策略等。
Improving Phishing Email Detection Performance of Small Large Language Models
Abstract
arXiv:2505.00034v1 Announce Type: cross Abstract: Large language models(LLMs) have demonstrated remarkable performance on many natural language processing(NLP) tasks and have been employed in phishing email detection research. However, in current studies, well-performing LLMs typically contain billions or even tens of billions of parameters, requiring enormous computational resources. To reduce computational costs, we investigated the effectiveness of small-parameter LLMs for phishing email detection. These LLMs have around 3 billion parameters and can run on consumer-grade GPUs. However, small LLMs often perform poorly in phishing email detection task. To address these issues, we designed a set of methods including Prompt Engineering, Explanation Augmented Fine-tuning, and Model Ensemble to improve phishing email detection capabilities of small LLMs. We validated the effectiveness of our approach through experiments, significantly improving accuracy on the SpamAssassin dataset from around 0.5 for baseline models like Qwen2.5-1.5B-Instruct to 0.976.
摘要
大型语言模型(LLMs)在众多自然语言处理(NLP)任务中展现出卓越性能,并已被应用于钓鱼邮件检测研究。然而,当前研究中表现优异的LLMs通常包含数十亿甚至数百亿参数,需要巨大的计算资源。为降低计算成本,我们探究了小参数LLMs在钓鱼邮件检测中的有效性。这些LLMs约含30亿参数,可在消费级GPU上运行。但小规模LLMs在钓鱼邮件检测任务中往往表现不佳。针对这一问题,我们设计了一套方法,包括提示工程、解释增强微调及模型集成,以提升小规模LLMs的钓鱼邮件检测能力。通过实验验证,我们的方法显著提高了模型性能,在SpamAssassin数据集上的准确率从Qwen2.5-1.5B-Instruct等基线模型的约0.5提升至0.976。
Fact-Consistency Evaluation of Text-to-SQL Generation for Business Intelligence Using Exaone 3.5
Abstract
arXiv:2505.00060v1 Announce Type: cross Abstract: Large Language Models (LLMs) have shown promise in enabling natural language interfaces for structured data querying through text-to-SQL generation. However, their application in real-world Business Intelligence (BI) contexts remains limited due to semantic hallucinations, structural errors, and a lack of domain-specific evaluation frameworks. In this study, we propose a Fact-Consistency Evaluation Framework for assessing the semantic accuracy of LLM-generated SQL outputs using Exaone 3.5--an instruction-tuned, bilingual LLM optimized for enterprise tasks. We construct a domain-specific benchmark comprising 219 natural language business questions across five SQL complexity levels, derived from actual sales data in LG Electronics' internal BigQuery environment. Each question is paired with a gold-standard SQL query and a validated ground-truth answer. We evaluate model performance using answer accuracy, execution success rate, semantic error rate, and non-response rate. Experimental results show that while Exaone 3.5 performs well on simple aggregation tasks (93% accuracy in L1), it exhibits substantial degradation in arithmetic reasoning (4% accuracy in H1) and grouped ranking tasks (31% in H4), with semantic errors and non-responses concentrated in complex cases. Qualitative error analysis further identifies common failure types such as misapplied arithmetic logic, incomplete filtering, and incorrect grouping operations. Our findings highlight the current limitations of LLMs in business-critical environments and underscore the need for fact-consistency validation layers and hybrid reasoning approaches. This work contributes a reproducible benchmark and evaluation methodology for advancing reliable natural language interfaces to structured enterprise data systems.
摘要
大型语言模型(LLMs)在通过文本到SQL生成实现结构化数据查询的自然语言接口方面展现出潜力。然而,由于语义幻觉、结构错误以及缺乏领域特定评估框架,其在真实商业智能(BI)场景中的应用仍受限。本研究提出一个事实一致性评估框架,利用专为企业任务优化的指令微调双语模型Exaone 3.5,用于评估LLM生成SQL输出的语义准确性。我们构建了一个领域特定基准测试,包含219个跨五个SQL复杂度级别的自然语言业务问题,数据源自LG电子内部BigQuery环境中的实际销售记录。每个问题均配有黄金标准SQL查询和经过验证的基准答案。通过答案准确率、执行成功率、语义错误率和无响应率等指标评估模型性能。实验结果表明,Exaone 3.5在简单聚合任务表现良好(L1级93%准确率),但在算术推理(H1级4%准确率)和分组排序任务(H4级31%)中性能显著下降,语义错误和无响应主要集中在复杂案例中。定性错误分析进一步识别出常见失败类型,如算术逻辑误用、过滤条件不完整和分组操作错误。本研究揭示了LLM在业务关键环境中的当前局限性,强调需要事实一致性验证层和混合推理方法。本工作贡献了可复现的基准测试和评估方法,以推进面向企业结构化数据系统的可靠自然语言接口发展。
CoordField: Coordination Field for Agentic UAV Task Allocation In Low-altitude Urban Scenarios
Abstract
arXiv:2505.00091v1 Announce Type: cross Abstract: With the increasing demand for heterogeneous Unmanned Aerial Vehicle (UAV) swarms to perform complex tasks in urban environments, system design now faces major challenges, including efficient semantic understanding, flexible task planning, and the ability to dynamically adjust coordination strategies in response to evolving environmental conditions and continuously changing task requirements. To address the limitations of existing approaches, this paper proposes coordination field agentic system for coordinating heterogeneous UAV swarms in complex urban scenarios. In this system, large language models (LLMs) is responsible for interpreting high-level human instructions and converting them into executable commands for the UAV swarms, such as patrol and target tracking. Subsequently, a Coordination field mechanism is proposed to guide UAV motion and task selection, enabling decentralized and adaptive allocation of emergent tasks. A total of 50 rounds of comparative testing were conducted across different models in a 2D simulation space to evaluate their performance. Experimental results demonstrate that the proposed system achieves superior performance in terms of task coverage, response time, and adaptability to dynamic changes.
摘要
随着城市环境中执行复杂任务的异构无人机群需求日益增长,系统设计面临重大挑战,包括高效语义理解、灵活任务规划以及根据环境条件演变和任务需求持续变化动态调整协调策略的能力。针对现有方法的局限性,本文提出一种用于复杂城市场景下异构无人机群协调的协调场代理系统。该系统采用大语言模型(LLMs)负责解析高层级人类指令并将其转化为可执行的无人机群指令(如巡逻与目标追踪),继而提出协调场机制来引导无人机运动与任务选择,实现突发任务的去中心化自适应分配。研究在二维仿真空间中对不同模型进行了共计50轮对比测试以评估其性能。实验结果表明,所提系统在任务覆盖率、响应时间及动态变化适应性方面均表现出优越性能。
Optimization of embeddings storage for RAG systems using quantization and dimensionality reduction techniques
Abstract
arXiv:2505.00105v1 Announce Type: cross Abstract: Retrieval-Augmented Generation enhances language models by retrieving relevant information from external knowledge bases, relying on high-dimensional vector embeddings typically stored in float32 precision. However, storing these embeddings at scale presents significant memory challenges. To address this issue, we systematically investigate on MTEB benchmark two complementary optimization strategies: quantization, evaluating standard formats (float16, int8, binary) and low-bit floating-point types (float8), and dimensionality reduction, assessing methods like PCA, Kernel PCA, UMAP, Random Projections and Autoencoders. Our results show that float8 quantization achieves a 4x storage reduction with minimal performance degradation (<0.3%), significantly outperforming int8 quantization at the same compression level, being simpler to implement. PCA emerges as the most effective dimensionality reduction technique. Crucially, combining moderate PCA (e.g., retaining 50% dimensions) with float8 quantization offers an excellent trade-off, achieving 8x total compression with less performance impact than using int8 alone (which provides only 4x compression). To facilitate practical application, we propose a methodology based on visualizing the performance-storage trade-off space to identify the optimal configuration that maximizes performance within their specific memory constraints.
摘要
检索增强生成技术通过从外部知识库检索相关信息来增强语言模型,其依赖于通常以float32精度存储的高维向量嵌入。然而,大规模存储这些嵌入向量会带来显著的内存挑战。为解决这一问题,我们在MTEB基准上系统研究了两种互补的优化策略:量化(评估标准格式如float16、int8、二值化及低比特浮点类型float8)和降维(评估PCA、核PCA、UMAP、随机投影及自编码器等方法)。实验结果表明,float8量化能以最小性能损失(<0.3%)实现4倍存储压缩,显著优于同等压缩级别的int8量化,且实现更简单。PCA被证明是最有效的降维技术。关键发现是,适度PCA(如保留50%维度)与float8量化的组合能提供最佳平衡,在实现8倍总压缩率的同时,其性能影响甚至小于单独使用int8(仅提供4倍压缩)。为促进实际应用,我们提出基于性能-存储权衡空间可视化的方法论,用于识别特定内存约束下能最大化性能的最优配置方案。
Fine-Tuning LLMs for Low-Resource Dialect Translation: The Case of Lebanese
Abstract
arXiv:2505.00114v1 Announce Type: cross Abstract: This paper examines the effectiveness of Large Language Models (LLMs) in translating the low-resource Lebanese dialect, focusing on the impact of culturally authentic data versus larger translated datasets. We compare three fine-tuning approaches: Basic, contrastive, and grammar-hint tuning, using open-source Aya23 models. Experiments reveal that models fine-tuned on a smaller but culturally aware Lebanese dataset (LW) consistently outperform those trained on larger, non-native data. The best results were achieved through contrastive fine-tuning paired with contrastive prompting, which indicates the benefits of exposing translation models to bad examples. In addition, to ensure authentic evaluation, we introduce LebEval, a new benchmark derived from native Lebanese content, and compare it to the existing FLoRes benchmark. Our findings challenge the "More Data is Better" paradigm and emphasize the crucial role of cultural authenticity in dialectal translation. We made our datasets and code available on Github.
摘要
本文研究了大规模语言模型(LLMs)在翻译低资源黎巴嫩方言时的有效性,重点关注文化真实性数据与大规模翻译数据集的影响。我们使用开源Aya23模型比较了三种微调方法:基础微调、对比微调和语法提示微调。实验表明,在较小但具有文化敏感性的黎巴嫩数据集(LW)上微调的模型,其表现始终优于基于大规模非本土数据训练的模型。最佳结果通过对比微调结合对比提示实现,这表明让翻译模型接触负面示例具有积极效果。此外,为确保评估真实性,我们引入了LebEval——一个源自本土黎巴嫩内容的新基准,并与现有FLoRes基准进行比较。研究发现挑战了'数据越多越好'的范式,强调了文化真实性在方言翻译中的关键作用。相关数据集和代码已发布于Github平台。
Between Underthinking and Overthinking: An Empirical Study of Reasoning Length and correctness in LLMs
Abstract
arXiv:2505.00127v1 Announce Type: cross Abstract: Large language models (LLMs) are increasingly optimized for long reasoning, under the assumption that more reasoning leads to better performance. However, emerging evidence suggests that longer responses can sometimes degrade accuracy rather than improve it. In this paper, we conduct a systematic empirical study of the relationship between reasoning length and answer correctness. We find that LLMs tend to overthink simple problems, generating unnecessarily long outputs, and underthink harder ones, failing to extend their reasoning when it is most needed. This indicates that models might misjudge problem difficulty and fail to calibrate their response length appropriately. Furthermore, we investigate the effects of length reduction with a preference optimization algorithm when simply preferring the shorter responses regardless of answer correctness. Experiments show that the generation length can be significantly reduced while maintaining acceptable accuracy. Our findings highlight generation length as a meaningful signal for reasoning behavior and motivate further exploration into LLMs' self-awareness in reasoning length adaptation.
摘要
大型语言模型(LLMs)正日益针对长推理进行优化,其假设是更长的推理会带来更好的性能。然而,新出现的证据表明,更长的回答有时反而会降低准确性而非提升。本文通过系统性的实证研究,探讨了推理长度与答案正确性之间的关系。研究发现,LLMs倾向于对简单问题过度思考,生成不必要的冗长输出,而对困难问题则思考不足,在最需要扩展推理时未能充分延伸。这表明模型可能误判问题难度,未能恰当地校准其回答长度。此外,我们通过偏好优化算法研究了长度缩减的效果——即在忽略答案正确性的情况下单纯偏好较短回答。实验表明,生成长度可显著缩短,同时保持可接受的准确性。我们的发现揭示了生成长度作为推理行为的重要信号,并为进一步探索LLMs在推理长度自适应中的自我意识提供了研究动机。
Audo-Sight: Enabling Ambient Interaction For Blind And Visually Impaired Individuals
Abstract
arXiv:2505.00153v1 Announce Type: cross Abstract: Visually impaired people face significant challenges when attempting to interact with and understand complex environments, and traditional assistive technologies often struggle to quickly provide necessary contextual understanding and interactive intelligence. This thesis presents Audo-Sight, a state-of-the-art assistive system that seamlessly integrates Multimodal Large Language Models (MLLMs) to provide expedient, context-aware interactions for Blind and Visually Impaired (BVI) individuals. The system operates in two different modalities: personalized interaction through user identification and public access in common spaces like museums and shopping malls. In tailored environments, the system adjusts its output to conform to the preferences of individual users, thus enhancing accessibility through a user-aware form of interaction. In shared environments, Audo-Sight employs a shared architecture that adapts to its current user with no manual reconfiguration required. To facilitate appropriate interactions with the LLM, the public Audo-Sight solution includes an Age-Range Determiner and Safe Query Filter. Additionally, the system ensures that responses are respectful to BVI users through NeMo Guardrails. By utilizing multimodal reasoning, BVI-cognizant response editing, and safeguarding features, this work represents a major leap in AI-driven accessibility technology capable of increasing autonomy, safety, and interaction for people with visual impairments in social settings. Finally, we present the integration of Audo-Sight and SmartSight, which enables enhanced situational awareness for BVI individuals. This integration takes advantage of the real-time visual analysis of SmartSight, combined with the extensive reasoning and interactive capabilities of Audo-Sight, and goes beyond object identification to provide context-driven, voice-controlled assistance in dynamic environments.
摘要
视障人士在与复杂环境交互和理解时面临重大挑战,传统辅助技术往往难以快速提供必要的上下文理解和交互智能。本研究提出Audo-Sight——一种集成多模态大语言模型(MLLM)的先进辅助系统,可为盲人及视障群体(BVI)提供便捷的情境感知交互。该系统具有两种运行模式:通过用户识别的个性化交互,以及在博物馆、购物中心等公共空间的通用访问。在定制环境中,系统会根据个体偏好调整输出,从而通过用户感知型交互提升可访问性;在共享环境中,Audo-Sight采用自适应架构,无需手动配置即可适应当前用户。为实现与大语言模型的恰当交互,公共版Audo-Sight解决方案包含年龄范围判定器和安全查询过滤器,并通过NeMo Guardrails确保响应内容对视障用户的尊重性。通过多模态推理、视障认知响应编辑及安全防护功能,该研究实现了AI驱动辅助技术的重大突破,能有效提升视障人士在社交场景中的自主性、安全性和交互能力。最后,我们展示了Audo-Sight与SmartSight的集成方案,该方案利用SmartSight的实时视觉分析能力,结合Audo-Sight强大的推理与交互功能,可超越基础物体识别,在动态环境中提供情境驱动的语音控制辅助,从而显著增强视障人士的环境感知能力。
Empirical Evaluation of Progressive Coding for Sparse Autoencoders
Abstract
arXiv:2505.00190v1 Announce Type: cross Abstract: Sparse autoencoders (SAEs) \citep{bricken2023monosemanticity,gao2024scalingevaluatingsparseautoencoders} rely on dictionary learning to extract interpretable features from neural networks at scale in an unsupervised manner, with applications to representation engineering and information retrieval. SAEs are, however, computationally expensive \citep{lieberum2024gemmascopeopensparse}, especially when multiple SAEs of different sizes are needed. We show that dictionary importance in vanilla SAEs follows a power law. We compare progressive coding based on subset pruning of SAEs -- to jointly training nested SAEs, or so-called {\em Matryoshka} SAEs \citep{bussmann2024learning,nabeshima2024Matryoshka} -- on a language modeling task. We show Matryoshka SAEs exhibit lower reconstruction loss and recaptured language modeling loss, as well as higher representational similarity. Pruned vanilla SAEs are more interpretable, however. We discuss the origins and implications of this trade-off.
摘要
稀疏自编码器(SAEs)通过字典学习以无监督方式大规模提取神经网络中的可解释特征,应用于表征工程与信息检索领域。然而SAEs存在计算成本高昂的问题,当需要不同规模的多个SAEs时尤为突出。本研究发现传统SAEs的字典重要性遵循幂律分布。我们在语言建模任务中比较了基于SAEs子集剪枝的渐进编码方法与联合训练嵌套SAEs(即"套娃"SAEs)的性能。实验表明套娃SAEs具有更低的重构损失与语言建模损失恢复值,以及更高的表征相似度;但剪枝后的传统SAEs可解释性更优。本文深入探讨了这种权衡关系的成因及其影响。
Scaling On-Device GPU Inference for Large Generative Models
Abstract
arXiv:2505.00232v1 Announce Type: cross Abstract: Driven by the advancements in generative AI, large machine learning models have revolutionized domains such as image processing, audio synthesis, and speech recognition. While server-based deployments remain the locus of peak performance, the imperative for on-device inference, necessitated by privacy and efficiency considerations, persists. Recognizing GPUs as the on-device ML accelerator with the widest reach, we present ML Drift--an optimized framework that extends the capabilities of state-of-the-art GPU-accelerated inference engines. ML Drift enables on-device execution of generative AI workloads which contain 10 to 100x more parameters than existing on-device generative AI models. ML Drift addresses intricate engineering challenges associated with cross-GPU API development, and ensures broad compatibility across mobile and desktop/laptop platforms, thereby facilitating the deployment of significantly more complex models on resource-constrained devices. Our GPU-accelerated ML/AI inference engine achieves an order-of-magnitude performance improvement relative to existing open-source GPU inference engines.
摘要
在生成式人工智能发展的推动下,大型机器学习模型已彻底改变了图像处理、音频合成和语音识别等领域。虽然基于服务器的部署仍保持峰值性能,但出于隐私和效率考量,设备端推理的需求持续存在。鉴于GPU是目前应用最广泛的设备端机器学习加速器,我们提出ML Drift——一个优化的框架,它扩展了最先进的GPU加速推理引擎的能力。ML Drift能够在设备端执行生成式AI工作负载,其参数量达到现有设备端生成式AI模型的10至100倍。该框架解决了跨GPU API开发相关的复杂工程挑战,并确保在移动和桌面/笔记本平台上的广泛兼容性,从而促进在资源受限设备上部署更复杂的模型。我们的GPU加速机器学习/人工智能推理引擎相较于现有开源GPU推理引擎,实现了数量级的性能提升。
LLM-Based Threat Detection and Prevention Framework for IoT Ecosystems
Abstract
arXiv:2505.00240v1 Announce Type: cross Abstract: The increasing complexity and scale of the Internet of Things (IoT) have made security a critical concern. This paper presents a novel Large Language Model (LLM)-based framework for comprehensive threat detection and prevention in IoT environments. The system integrates lightweight LLMs fine-tuned on IoT-specific datasets (IoT-23, TON_IoT) for real-time anomaly detection and automated, context-aware mitigation strategies optimized for resource-constrained devices. A modular Docker-based deployment enables scalable and reproducible evaluation across diverse network conditions. Experimental results in simulated IoT environments demonstrate significant improvements in detection accuracy, response latency, and resource efficiency over traditional security methods. The proposed framework highlights the potential of LLM-driven, autonomous security solutions for future IoT ecosystems.
摘要
随着物联网(IoT)复杂性和规模的日益增长,安全性已成为关键问题。本文提出了一种基于大型语言模型(LLM)的新型框架,用于物联网环境中的全面威胁检测与防护。该系统集成了针对物联网专用数据集(IoT-23、TON_IoT)进行微调的轻量级LLM,可实现实时异常检测,并部署专为资源受限设备优化的自动化情境感知缓解策略。基于Docker的模块化部署方案支持在不同网络条件下进行可扩展、可复现的评估。模拟物联网环境中的实验结果表明,相较于传统安全方法,该框架在检测精度、响应延迟和资源效率方面均有显著提升。所提出的框架凸显了LLM驱动的自主安全解决方案在未来物联网生态系统中的应用潜力。
Consistency in Language Models: Current Landscape, Challenges, and Future Directions
Abstract
arXiv:2505.00268v1 Announce Type: cross Abstract: The hallmark of effective language use lies in consistency -- expressing similar meanings in similar contexts and avoiding contradictions. While human communication naturally demonstrates this principle, state-of-the-art language models struggle to maintain reliable consistency across different scenarios. This paper examines the landscape of consistency research in AI language systems, exploring both formal consistency (including logical rule adherence) and informal consistency (such as moral and factual coherence). We analyze current approaches to measure aspects of consistency, identify critical research gaps in standardization of definitions, multilingual assessment, and methods to improve consistency. Our findings point to an urgent need for robust benchmarks to measure and interdisciplinary approaches to ensure consistency in the application of language models on domain-specific tasks while preserving the utility and adaptability.
摘要
有效语言运用的核心在于一致性——在相似语境中表达相近含义并避免矛盾。虽然人类交流天然遵循这一原则,但最先进的语言模型仍难以在不同情境下保持可靠的一致性。本文系统考察了人工智能语言系统的一致性研究现状,既探讨了形式一致性(包括逻辑规则遵循),也分析了非形式一致性(如道德与事实连贯性)。我们评估了当前衡量一致性各个维度的研究方法,指出了定义标准化、多语言评估以及提升一致性方法等领域的关键研究空白。研究结果表明,亟需建立强有力的基准测试来衡量语言模型的一致性,并采用跨学科方法以确保其在领域特定任务应用中保持一致性,同时不损害实用性与适应性。
Pushing the Limits of Low-Bit Optimizers: A Focus on EMA Dynamics
Abstract
arXiv:2505.00347v1 Announce Type: cross Abstract: The explosion in model sizes leads to continued growth in prohibitive training/fine-tuning costs, particularly for stateful optimizers which maintain auxiliary information of even 2x the model size to achieve optimal convergence. We therefore present in this work a novel type of optimizer that carries with extremely lightweight state overloads, achieved through ultra-low-precision quantization. While previous efforts have achieved certain success with 8-bit or 4-bit quantization, our approach enables optimizers to operate at precision as low as 3 bits, or even 2 bits per state element. This is accomplished by identifying and addressing two critical challenges: the signal swamping problem in unsigned quantization that results in unchanged state dynamics, and the rapidly increased gradient variance in signed quantization that leads to incorrect descent directions. The theoretical analysis suggests a tailored logarithmic quantization for the former and a precision-specific momentum value for the latter. Consequently, the proposed SOLO achieves substantial memory savings (approximately 45 GB when training a 7B model) with minimal accuracy loss. We hope that SOLO can contribute to overcoming the bottleneck in computational resources, thereby promoting greater accessibility in fundamental research.
摘要
模型规模的爆炸式增长导致训练/微调成本持续攀升,尤其对于需维护相当于模型尺寸2倍辅助信息以实现最优收敛的状态优化器而言。为此,我们提出一种新型优化器,通过超低精度量化技术实现极轻量级状态负载。虽然先前研究已在8位或4位量化方面取得一定成果,但本方案能使优化器以低至3位甚至每位状态元素2位的精度运行。这通过识别并解决两个关键挑战实现:无符号量化中导致状态动态不变的信号淹没问题,以及有符号量化中因梯度方差快速增大引发的错误下降方向问题。理论分析表明,前者需采用定制化对数量化方案,后者则需要精度特定的动量值。实验证明,所提出的SOLO方法在实现显著内存节省(训练7B模型时约45GB)的同时仅产生极小精度损失。我们期待SOLO能助力突破计算资源瓶颈,从而提升基础研究的可及性。
Optimizing Deep Neural Networks using Safety-Guided Self Compression
Abstract
arXiv:2505.00350v1 Announce Type: cross Abstract: The deployment of deep neural networks on resource-constrained devices necessitates effective model com- pression strategies that judiciously balance the reduction of model size with the preservation of performance. This study introduces a novel safety-driven quantization framework that leverages preservation sets to systematically prune and quantize neural network weights, thereby optimizing model complexity without compromising accuracy. The proposed methodology is rigorously evaluated on both a convolutional neural network (CNN) and an attention-based language model, demonstrating its applicability across diverse architectural paradigms. Experimental results reveal that our framework achieves up to a 2.5% enhancement in test accuracy relative to the original unquantized models while maintaining 60% of the initial model size. In comparison to conventional quantization techniques, our approach not only augments generalization by eliminating parameter noise and retaining essential weights but also reduces variance, thereby ensuring the retention of critical model features. These findings underscore the efficacy of safety-driven quantization as a robust and reliable strategy for the efficient optimization of deep learn- ing models. The implementation and comprehensive experimental evaluations of our framework are publicly accessible at GitHub.
摘要
在资源受限设备上部署深度神经网络需要有效的模型压缩策略,以明智地平衡模型规模缩减与性能保持之间的关系。本研究提出了一种新颖的安全驱动量化框架,该框架利用保护集对神经网络权重进行系统化剪枝和量化,从而在不牺牲精度的情况下优化模型复杂度。所提方法在卷积神经网络(CNN)和基于注意力的语言模型上均进行了严格评估,证明了其在不同架构范式中的适用性。实验结果表明,本框架在保持初始模型规模60%的同时,测试准确率较原始未量化模型最高可提升2.5%。与传统量化技术相比,该方法通过消除参数噪声并保留关键权重,不仅增强了泛化能力,还降低了方差,从而确保模型关键特征的保留。这些发现印证了安全驱动量化作为一种鲁棒可靠的策略,能有效优化深度学习模型。本框架的实现及完整实验评估已在GitHub上公开。
R&B: Domain Regrouping and Data Mixture Balancing for Efficient Foundation Model Training
Abstract
arXiv:2505.00358v1 Announce Type: cross Abstract: Data mixing strategies have successfully reduced the costs involved in training language models. While promising, such methods suffer from two flaws. First, they rely on predetermined data domains (e.g., data sources, task types), which may fail to capture critical semantic nuances, leaving performance on the table. Second, these methods scale with the number of domains in a computationally prohibitive way. We address these challenges via R&B, a framework that re-partitions training data based on semantic similarity (Regroup) to create finer-grained domains, and efficiently optimizes the data composition (Balance) by leveraging a Gram matrix induced by domain gradients obtained throughout training. Unlike prior works, it removes the need for additional compute to obtain evaluation information such as losses or gradients. We analyze this technique under standard regularity conditions and provide theoretical insights that justify R&B's effectiveness compared to non-adaptive mixing approaches. Empirically, we demonstrate the effectiveness of R&B on five diverse datasets ranging from natural language to reasoning and multimodal tasks. With as little as 0.01% additional compute overhead, R&B matches or exceeds the performance of state-of-the-art data mixing strategies.
摘要
数据混合策略已成功降低了训练语言模型的成本。尽管前景广阔,此类方法仍存在两个缺陷:首先,它们依赖预定义的数据域(如数据源、任务类型),可能无法捕捉关键的语义细微差别,导致性能未能充分发挥;其次,这些方法的计算复杂度随数据域数量呈指数级增长。我们通过R&B框架解决这些问题——该框架基于语义相似性(重组)重新划分训练数据以创建更细粒度的域,并利用训练过程中获得的域梯度所诱导的Gram矩阵高效优化数据组合(平衡)。与现有工作不同,它无需额外计算评估信息(如损失或梯度)。我们在标准正则条件下对该技术进行分析,并从理论上证明了R&B相较于非自适应混合方法的有效性。实证方面,我们在涵盖自然语言、推理及多模态任务的五个多样化数据集上验证了R&B的效能。仅需0.01%的额外计算开销,R&B即可达到或超越最先进数据混合策略的性能表现。
KoACD: The First Korean Adolescent Dataset for Cognitive Distortion Analysis
Abstract
arXiv:2505.00367v1 Announce Type: cross Abstract: Cognitive distortion refers to negative thinking patterns that can lead to mental health issues like depression and anxiety in adolescents. Previous studies using natural language processing (NLP) have focused mainly on small-scale adult datasets, with limited research on adolescents. This study introduces KoACD, the first large-scale dataset of cognitive distortions in Korean adolescents, containing 108,717 instances. We applied a multi-Large Language Model (LLM) negotiation method to refine distortion classification and generate synthetic data using two approaches: cognitive clarification for textual clarity and cognitive balancing for diverse distortion representation. Validation through LLMs and expert evaluations showed that while LLMs classified distortions with explicit markers, they struggled with context-dependent reasoning, where human evaluators demonstrated higher accuracy. KoACD aims to enhance future research on cognitive distortion detection.
摘要
认知扭曲是指可能导致青少年抑郁和焦虑等心理健康问题的消极思维模式。既往采用自然语言处理(NLP)的研究主要集中于小规模成人数据集,针对青少年的研究较为有限。本研究首次构建了韩国青少年认知扭曲大规模数据集KoACD,包含108,717条实例。我们采用多大型语言模型(LLM)协商方法优化扭曲分类,并通过两种途径生成合成数据:用于文本清晰化的认知澄清法,以及用于多样化扭曲表征的认知平衡法。经LLM和专家评估验证发现,虽然LLM能有效识别具有显性标记的认知扭曲,但在依赖上下文推理时表现欠佳,而人类评估者则展现出更高准确性。KoACD数据集旨在为未来认知扭曲检测研究提供支持。
Data Therapist: Eliciting Domain Knowledge from Subject Matter Experts Using Large Language Models
Abstract
arXiv:2505.00455v1 Announce Type: cross Abstract: Effective data visualization requires not only technical proficiency but also a deep understanding of the domain-specific context in which data exists. This context often includes tacit knowledge about data provenance, quality, and intended use, which is rarely explicit in the dataset itself. We present the Data Therapist, a web-based tool that helps domain experts externalize this implicit knowledge through a mixed-initiative process combining iterative Q&A with interactive annotation. Powered by a large language model, the system analyzes user-supplied datasets, prompts users with targeted questions, and allows annotation at varying levels of granularity. The resulting structured knowledge base can inform both human and automated visualization design. We evaluated the tool in a qualitative study involving expert pairs from Molecular Biology, Accounting, Political Science, and Usable Security. The study revealed recurring patterns in how experts reason about their data and highlights areas where AI support can improve visualization design.
摘要
有效的数据可视化不仅需要技术熟练度,还需深刻理解数据所处的特定领域背景。这一背景通常包含关于数据来源、质量及预期用途的隐性知识,而这些信息很少在数据集中明确体现。我们提出"数据治疗师"——一种基于网络的工具,通过结合迭代问答与交互式标注的混合主动流程,帮助领域专家外化这类隐含知识。该系统由大语言模型驱动,可分析用户提供的数据集、提出针对性问题,并支持不同粒度级别的标注。最终形成的结构化知识库可为人类和自动化可视化设计提供参考。我们在分子生物学、会计学、政治学及可用性安全四个领域的专家配对定性研究中评估了该工具。研究揭示了专家推理数据时的重复模式,并指出人工智能支持可优化可视化设计的重点领域。
Red Teaming Large Language Models for Healthcare
Abstract
arXiv:2505.00467v1 Announce Type: cross Abstract: We present the design process and findings of the pre-conference workshop at the Machine Learning for Healthcare Conference (2024) entitled Red Teaming Large Language Models for Healthcare, which took place on August 15, 2024. Conference participants, comprising a mix of computational and clinical expertise, attempted to discover vulnerabilities -- realistic clinical prompts for which a large language model (LLM) outputs a response that could cause clinical harm. Red-teaming with clinicians enables the identification of LLM vulnerabilities that may not be recognised by LLM developers lacking clinical expertise. We report the vulnerabilities found, categorise them, and present the results of a replication study assessing the vulnerabilities across all LLMs provided.
摘要
我们介绍了2024年8月15日举办的"医疗健康领域大语言模型红队测试"预会议研讨会(隶属于2024年医疗健康机器学习会议)的设计流程与研究成果。由计算科学与临床医学专家共同组成的会议参与者们,致力于发现大语言模型(LLM)存在的潜在漏洞——即那些可能导致临床危害的现实医疗场景提示词。通过与临床医师开展红队测试,能够识别出缺乏临床专业知识的LLM开发者可能忽略的模型缺陷。本研究系统报告了所发现的漏洞,对其进行了分类,并呈现了针对所有测试LLM的漏洞复现研究结果。
HalluMix: A Task-Agnostic, Multi-Domain Benchmark for Real-World Hallucination Detection
Abstract
arXiv:2505.00506v1 Announce Type: cross Abstract: As large language models (LLMs) are increasingly deployed in high-stakes domains, detecting hallucinated content\unicode{x2013}text that is not grounded in supporting evidence\unicode{x2013}has become a critical challenge. Existing benchmarks for hallucination detection are often synthetically generated, narrowly focused on extractive question answering, and fail to capture the complexity of real-world scenarios involving multi-document contexts and full-sentence outputs. We introduce the HalluMix Benchmark, a diverse, task-agnostic dataset that includes examples from a range of domains and formats. Using this benchmark, we evaluate seven hallucination detection systems\unicode{x2013}both open and closed source\unicode{x2013}highlighting differences in performance across tasks, document lengths, and input representations. Our analysis highlights substantial performance disparities between short and long contexts, with critical implications for real-world Retrieval Augmented Generation (RAG) implementations. Quotient Detections achieves the best overall performance, with an accuracy of 0.82 and an F1 score of 0.84.
摘要
随着大语言模型(LLMs)在高风险领域的广泛应用,检测幻觉内容——即缺乏证据支持的生成文本——已成为关键挑战。现有幻觉检测基准多为合成生成,仅聚焦于抽取式问答任务,难以捕捉多文档语境和完整句子输出的现实场景复杂性。本文提出HalluMix Benchmark,这是一个多样化、任务无关的数据集,涵盖多领域和多格式的样本。基于该基准,我们评估了七种开源与闭源的幻觉检测系统,揭示其在任务类型、文本长度和输入表征方面的性能差异。分析表明,长短文本语境间存在显著性能差距,这对现实中的检索增强生成(RAG)应用具有重要启示。其中Quotient Detections系统表现最优,准确率达0.82,F1分数为0.84。
Leveraging Partial SMILES Validation Scheme for Enhanced Drug Design in Reinforcement Learning Frameworks
Abstract
arXiv:2505.00530v1 Announce Type: cross Abstract: SMILES-based molecule generation has emerged as a powerful approach in drug discovery. Deep reinforcement learning (RL) using large language model (LLM) has been incorporated into the molecule generation process to achieve high matching score in term of likelihood of desired molecule candidates. However, a critical challenge in this approach is catastrophic forgetting during the RL phase, where knowledge such as molecule validity, which often exceeds 99% during pretraining, significantly deteriorates. Current RL algorithms applied in drug discovery, such as REINVENT, use prior models as anchors to retian pretraining knowledge, but these methods lack robust exploration mechanisms. To address these issues, we propose Partial SMILES Validation-PPO (PSV-PPO), a novel RL algorithm that incorporates real-time partial SMILES validation to prevent catastrophic forgetting while encouraging exploration. Unlike traditional RL approaches that validate molecule structures only after generating entire sequences, PSV-PPO performs stepwise validation at each auto-regressive step, evaluating not only the selected token candidate but also all potential branches stemming from the prior partial sequence. This enables early detection of invalid partial SMILES across all potential paths. As a result, PSV-PPO maintains high validity rates even during aggressive exploration of the vast chemical space. Our experiments on the PMO and GuacaMol benchmark datasets demonstrate that PSV-PPO significantly reduces the number of invalid generated structures while maintaining competitive exploration and optimization performance. While our work primarily focuses on maintaining validity, the framework of PSV-PPO can be extended in future research to incorporate additional forms of valuable domain knowledge, further enhancing reinforcement learning applications in drug discovery.
摘要
基于SMILES的分子生成已成为药物发现领域的重要方法。为获得理想候选分子高似然匹配分数,研究者将基于大语言模型(LLM)的深度强化学习(RL)引入分子生成过程。然而该方法存在关键挑战:RL阶段会出现灾难性遗忘现象——预训练阶段通常超过99%的分子有效性等知识会显著退化。当前药物发现中应用的RL算法(如REINVENT)采用先验模型作为锚点来保留预训练知识,但这些方法缺乏稳健的探索机制。为解决这些问题,我们提出部分SMILES验证PPO算法(PSV-PPO),这种新型RL算法通过实时部分SMILES验证来防止灾难性遗忘并促进探索。与传统RL方法仅在生成完整序列后验证分子结构不同,PSV-PPO在每一步自回归生成时进行逐步验证,不仅评估所选token候选,还评估源自先前部分序列的所有潜在分支。这实现了对所有潜在路径中无效部分SMILES的早期检测。因此,PSV-PPO即便在激进探索巨大化学空间时仍能保持高有效性。我们在PMO和GuacaMol基准数据集上的实验表明,PSV-PPO在保持竞争性探索和优化性能的同时,显著减少了无效结构的生成数量。虽然本研究主要聚焦有效性保持,但PSV-PPO框架可在未来研究中扩展至其他有价值的领域知识形式,从而进一步增强强化学习在药物发现中的应用。
Triggering Hallucinations in LLMs: A Quantitative Study of Prompt-Induced Hallucination in Large Language Models
Abstract
arXiv:2505.00557v1 Announce Type: cross Abstract: Hallucinations in large language models (LLMs) present a growing challenge across real-world applications, from healthcare to law, where factual reliability is essential. Despite advances in alignment and instruction tuning, LLMs can still generate outputs that are fluent yet fundamentally untrue. Understanding the cognitive dynamics that underlie these hallucinations remains an open problem. In this study, we propose a prompt-based framework to systematically trigger and quantify hallucination: a Hallucination-Inducing Prompt (HIP), which synthetically fuses semantically distant concepts (e.g., periodic table of elements and tarot divination) in a misleading way, and a Hallucination Quantifying Prompt (HQP), which scores the plausibility, confidence, and coherence of the output. Controlled experiments across multiple LLMs revealed that HIPs consistently produced less coherent and more hallucinated responses than their null-fusion controls. These effects varied across models, with reasoning-oriented LLMs showing distinct profiles from general-purpose ones. Our framework provides a reproducible testbed for studying hallucination vulnerability, and opens the door to developing safer, more introspective LLMs that can detect and self-regulate the onset of conceptual instability.
摘要
大型语言模型(LLMs)的幻觉现象在现实应用中构成了日益严峻的挑战,特别是在医疗、法律等需要事实可靠性的领域。尽管对齐和指令微调技术取得了进展,LLMs仍可能生成流畅但根本错误的输出。理解这些幻觉背后的认知机制仍是一个未解决的问题。本研究提出一种基于提示的框架来系统性地诱发和量化幻觉:幻觉诱导提示(HIP)通过误导性方式合成融合语义疏远的概念(如元素周期表与塔罗占卜),以及幻觉量化提示(HQP)对输出的合理性、置信度和连贯性进行评分。跨多个LLMs的对照实验表明,与零融合对照组相比,HIPs持续产生连贯性更低、幻觉更严重的响应。不同模型间存在效应差异,推理导向型LLMs表现出与通用模型不同的特征。该框架为研究幻觉脆弱性提供了可复现的测试平台,并为开发能够检测和自我调节概念不稳定性、更安全且更具自省能力的LLMs开辟了新途径。
FreqKV: Frequency Domain Key-Value Compression for Efficient Context Window Extension
Abstract
arXiv:2505.00570v1 Announce Type: cross Abstract: Extending the context window in large language models (LLMs) is essential for applications involving long-form content generation. However, the linear increase in key-value (KV) cache memory requirements and the quadratic complexity of self-attention with respect to sequence length present significant challenges during fine-tuning and inference. Existing methods suffer from performance degradation when extending to longer contexts. In this work, we introduce a novel context extension method that optimizes both fine-tuning and inference efficiency. Our method exploits a key observation: in the frequency domain, the energy distribution of the KV cache is primarily concentrated in low-frequency components. By filtering out the high-frequency components, the KV cache can be effectively compressed with minimal information loss. Building on this insight, we propose an efficient compression technique, FreqKV, that iteratively compresses the increasing KV cache to a fixed size in the frequency domain, applicable to both fine-tuning and inference. FreqKV introduces no additional parameters or architectural modifications. With minimal fine-tuning, LLMs can learn to leverage the limited cache that is compressed in the frequency domain and extend the context window efficiently. Experiments on various long context language modeling and understanding tasks demonstrate the efficiency and efficacy of the proposed method.
摘要
扩展大语言模型(LLMs)的上下文窗口对于涉及长文本生成的应用至关重要。然而,键值(KV)缓存内存需求的线性增长以及自注意力机制相对于序列长度的二次复杂度,在微调和推理过程中带来了显著挑战。现有方法在扩展到更长上下文时存在性能下降问题。本研究提出了一种新颖的上下文扩展方法,可同时优化微调和推理效率。我们的方法基于一个关键发现:在频域中,KV缓存的能量分布主要集中于低频分量。通过滤除高频分量,KV缓存能以最小信息损失实现有效压缩。基于此洞见,我们提出了一种高效压缩技术FreqKV,该技术可在频域中将不断增长的KV缓存迭代压缩至固定尺寸,适用于微调和推理场景。FreqKV无需引入额外参数或架构修改。通过极少量微调,LLMs即可学会利用频域压缩后的有限缓存,从而高效扩展上下文窗口。在多种长上下文语言建模与理解任务上的实验验证了该方法的效率与有效性。
FineScope : Precision Pruning for Domain-Specialized Large Language Models Using SAE-Guided Self-Data Cultivation
Abstract
arXiv:2505.00624v1 Announce Type: cross Abstract: Training large language models (LLMs) from scratch requires significant computational resources, driving interest in developing smaller, domain-specific LLMs that maintain both efficiency and strong task performance. Medium-sized models such as LLaMA, llama} have served as starting points for domain-specific adaptation, but they often suffer from accuracy degradation when tested on specialized datasets. We introduce FineScope, a framework for deriving compact, domain-optimized LLMs from larger pretrained models. FineScope leverages the Sparse Autoencoder (SAE) framework, inspired by its ability to produce interpretable feature representations, to extract domain-specific subsets from large datasets. We apply structured pruning with domain-specific constraints, ensuring that the resulting pruned models retain essential knowledge for the target domain. To further enhance performance, these pruned models undergo self-data distillation, leveraging SAE-curated datasets to restore key domain-specific information lost during pruning. Extensive experiments and ablation studies demonstrate that FineScope achieves highly competitive performance, outperforming several large-scale state-of-the-art LLMs in domain-specific tasks. Additionally, our results show that FineScope enables pruned models to regain a substantial portion of their original performance when fine-tuned with SAE-curated datasets. Furthermore, applying these datasets to fine-tune pretrained LLMs without pruning also improves their domain-specific accuracy, highlighting the robustness of our approach. The code will be released.
摘要
训练大型语言模型(LLM)需要大量计算资源,这促使研究者关注开发既能保持高效性又具备强大任务性能的小型领域专用模型。诸如LLaMA等中等规模模型虽常作为领域适应的起点,但在专业数据集测试中往往存在精度下降问题。本文提出FineScope框架,旨在从大规模预训练模型中提取紧凑的领域优化模型。该框架基于稀疏自编码器(SAE)架构——其可解释特征表示能力为设计灵感,用于从海量数据中提取领域相关子集。我们采用带领域约束的结构化剪枝技术,确保剪枝后的模型保留目标领域核心知识。为进一步提升性能,剪枝模型通过自数据蒸馏过程,利用SAE筛选的数据集恢复剪枝过程中丢失的关键领域信息。大量实验与消融研究表明,FineScope在领域任务中展现出极具竞争力的性能,优于多个最先进的大规模LLM。此外,结果显示当采用SAE筛选数据集微调时,剪枝模型能恢复其原始性能的显著部分。值得注意的是,即使不对预训练模型进行剪枝,使用这些数据集进行微调同样能提升其领域准确率,这验证了本方法的鲁棒性。相关代码将予以开源。
The Illusion of Role Separation: Hidden Shortcuts in LLM Role Learning (and How to Fix Them)
Abstract
arXiv:2505.00626v1 Announce Type: cross Abstract: Large language models (LLMs) that integrate multiple input roles (e.g., system instructions, user queries, external tool outputs) are increasingly prevalent in practice. Ensuring that the model accurately distinguishes messages from each role -- a concept we call \emph{role separation} -- is crucial for consistent multi-role behavior. Although recent work often targets state-of-the-art prompt injection defenses, it remains unclear whether such methods truly teach LLMs to differentiate roles or merely memorize known triggers. In this paper, we examine \emph{role-separation learning}: the process of teaching LLMs to robustly distinguish system and user tokens. Through a \emph{simple, controlled experimental framework}, we find that fine-tuned models often rely on two proxies for role identification: (1) task type exploitation, and (2) proximity to begin-of-text. Although data augmentation can partially mitigate these shortcuts, it generally leads to iterative patching rather than a deeper fix. To address this, we propose reinforcing \emph{invariant signals} that mark role boundaries by adjusting token-wise cues in the model's input encoding. In particular, manipulating position IDs helps the model learn clearer distinctions and reduces reliance on superficial proxies. By focusing on this mechanism-centered perspective, our work illuminates how LLMs can more reliably maintain consistent multi-role behavior without merely memorizing known prompts or triggers.
摘要
能够整合多种输入角色(如系统指令、用户查询、外部工具输出)的大语言模型(LLMs)在实践中日益普及。确保模型准确区分不同角色的消息——我们称之为\emph{角色分离}——对于保持多角色行为的一致性至关重要。尽管近期研究多聚焦于最先进的提示注入防御,但此类方法是否真正教会了LLMs区分角色,还是仅仅记住了已知的触发模式,仍不明确。本文研究了\emph{角色分离学习}:即教导LLMs稳健区分系统与用户标记的过程。通过一个\emph{简单可控的实验框架},我们发现微调后的模型通常依赖两种角色识别代理:(1)任务类型利用,以及(2)与文本起始位置的邻近性。虽然数据增强可以部分缓解这些捷径,但通常会导致迭代修补而非根本解决。为此,我们提出通过调整模型输入编码中的逐标记线索,强化标记角色边界的\emph{不变信号}。特别是,操纵位置ID有助于模型学习更清晰的区分,并减少对表层代理的依赖。通过这种以机制为中心的视角,我们的工作阐明了LLMs如何更可靠地保持多角色行为的一致性,而不仅仅是记忆已知提示或触发模式。
Large Language Models Understanding: an Inherent Ambiguity Barrier
Abstract
arXiv:2505.00654v1 Announce Type: cross Abstract: A lively ongoing debate is taking place, since the extraordinary emergence of Large Language Models (LLMs) with regards to their capability to understand the world and capture the meaning of the dialogues in which they are involved. Arguments and counter-arguments have been proposed based upon thought experiments, anecdotal conversations between LLMs and humans, statistical linguistic analysis, philosophical considerations, and more. In this brief paper we present a counter-argument based upon a thought experiment and semi-formal considerations leading to an inherent ambiguity barrier which prevents LLMs from having any understanding of what their amazingly fluent dialogues mean.
摘要
自大型语言模型(LLMs)以非凡姿态涌现以来,关于其理解世界及捕捉对话意义能力的激烈争论持续进行。支持与反对观点基于思想实验、LLMs与人类间的轶事对话、统计语言分析、哲学思辨等多种论据展开。本文通过思想实验与半形式化论证提出反驳观点,指出LLMs存在固有的模糊性障碍,这一障碍使其无法真正理解那些流畅对话的实际意义。
On the generalization of language models from in-context learning and finetuning: a controlled study
Abstract
arXiv:2505.00661v1 Announce Type: cross Abstract: Large language models exhibit exciting capabilities, yet can show surprisingly narrow generalization from finetuning -- from failing to generalize to simple reversals of relations they are trained on, to missing logical deductions that can be made from trained information. These failures to generalize from fine-tuning can hinder practical application of these models. However, language models' in-context learning shows different inductive biases, and can generalize better in some of these cases. Here, we explore these differences in generalization between in-context- and fine-tuning-based learning. To do so, we constructed several novel datasets to evaluate and improve models' ability to generalize from finetuning data. The datasets are constructed to isolate the knowledge in the dataset from that in pretraining, to create clean tests of generalization. We expose pretrained large models to controlled subsets of the information in these datasets -- either in context, or through fine-tuning -- and evaluate their performance on test sets that require various types of generalization. We find overall that in data-matched settings, in-context learning can generalize more flexibly than fine-tuning (though we also find some qualifications of prior findings, such as cases when fine-tuning can generalize to reversals embedded in a larger structure of knowledge). We build on these findings to propose a method to enable improved generalization from fine-tuning: adding in-context inferences to finetuning data. We show that this method improves generalization across various splits of our datasets and other benchmarks. Our results have implications for understanding the inductive biases of different modes of learning in language models, and practically improving their performance.
摘要
大型语言模型展现出令人振奋的能力,但在微调后却可能表现出惊人的狭义泛化局限——从无法推广到训练关系的简单反转,到遗漏基于已训练信息可进行的逻辑推理。这些微调后的泛化失败可能阻碍模型的实际应用。然而,语言模型的上下文学习展现出不同的归纳偏好,在某些情况下能实现更好的泛化。本研究系统探究了基于上下文学习与基于微调学习在泛化能力上的差异。为此,我们构建了多个新型数据集来评估和提升模型从微调数据中泛化的能力。这些数据集的设计旨在将数据集中的知识与预训练知识相隔离,从而创建纯净的泛化测试环境。我们让预训练大模型接触这些数据集中受控信息子集——通过上下文或微调方式——并评估其在需要各类泛化的测试集上的表现。总体发现表明,在数据匹配条件下,上下文学习比微调能实现更灵活的泛化(尽管我们也发现对先前研究结果的若干限定条件,例如当微调能泛化至嵌入更大知识结构的反转关系时)。基于这些发现,我们提出了一种改进微调泛化的方法:在微调数据中添加上下文推理。实验证明该方法在我们数据集的不同划分及其他基准测试中均能提升泛化性能。本研究结果对理解语言模型不同学习模式的归纳偏好具有理论意义,同时为实际提升模型性能提供了可行方案。
DeepCritic: Deliberate Critique with Large Language Models
Abstract
arXiv:2505.00662v1 Announce Type: cross Abstract: As Large Language Models (LLMs) are rapidly evolving, providing accurate feedback and scalable oversight on their outputs becomes an urgent and critical problem. Leveraging LLMs as critique models to achieve automated supervision is a promising solution. In this work, we focus on studying and enhancing the math critique ability of LLMs. Current LLM critics provide critiques that are too shallow and superficial on each step, leading to low judgment accuracy and struggling to offer sufficient feedback for the LLM generator to correct mistakes. To tackle this issue, we propose a novel and effective two-stage framework to develop LLM critics that are capable of deliberately critiquing on each reasoning step of math solutions. In the first stage, we utilize Qwen2.5-72B-Instruct to generate 4.5K long-form critiques as seed data for supervised fine-tuning. Each seed critique consists of deliberate step-wise critiques that includes multi-perspective verifications as well as in-depth critiques of initial critiques for each reasoning step. Then, we perform reinforcement learning on the fine-tuned model with either existing human-labeled data from PRM800K or our automatically annotated data obtained via Monte Carlo sampling-based correctness estimation, to further incentivize its critique ability. Our developed critique model built on Qwen2.5-7B-Instruct not only significantly outperforms existing LLM critics (including the same-sized DeepSeek-R1-distill models and GPT-4o) on various error identification benchmarks, but also more effectively helps the LLM generator refine erroneous steps through more detailed feedback.
摘要
随着大语言模型(LLMs)的快速发展,如何对其输出提供准确反馈并实现可扩展的监督已成为亟待解决的关键问题。利用LLMs作为批判模型以实现自动化监督是一种颇具前景的解决方案。本研究重点探索并提升LLMs的数学批判能力。现有LLM批判模型提供的分步评析过于浅显,导致判断准确率低下,且难以为生成模型提供足够的纠错反馈。为解决这一问题,我们提出了一种新颖有效的两阶段框架,用于开发能够对数学解题过程的每个推理步骤进行细致批判的LLM评审模型。第一阶段,我们使用Qwen2.5-72B-Instruct生成4.5K份长式评析作为监督微调的种子数据,每份种子评析包含针对各推理步骤的多角度验证以及对初始评析的深度批判。随后,我们基于PRM800K的人类标注数据或通过蒙特卡洛采样正确性估计获得的自动标注数据,对微调后的模型进行强化学习,以进一步提升其批判能力。基于Qwen2.5-7B-Instruct构建的批判模型不仅在多个错误识别基准测试中显著优于现有LLM批判模型(包括同体量的DeepSeek-R1-distill模型和GPT-4o),还能通过更详尽的反馈更有效地辅助生成模型修正错误步骤。
T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT
Abstract
arXiv:2505.00703v1 Announce Type: cross Abstract: Recent advancements in large language models have demonstrated how chain-of-thought (CoT) and reinforcement learning (RL) can improve performance. However, applying such reasoning strategies to the visual generation domain remains largely unexplored. In this paper, we present T2I-R1, a novel reasoning-enhanced text-to-image generation model, powered by RL with a bi-level CoT reasoning process. Specifically, we identify two levels of CoT that can be utilized to enhance different stages of generation: (1) the semantic-level CoT for high-level planning of the prompt and (2) the token-level CoT for low-level pixel processing during patch-by-patch generation. To better coordinate these two levels of CoT, we introduce BiCoT-GRPO with an ensemble of generation rewards, which seamlessly optimizes both generation CoTs within the same training step. By applying our reasoning strategies to the baseline model, Janus-Pro, we achieve superior performance with 13% improvement on T2I-CompBench and 19% improvement on the WISE benchmark, even surpassing the state-of-the-art model FLUX.1. Code is available at: https://github.com/CaraJ7/T2I-R1
摘要
大型语言模型的最新进展揭示了思维链(CoT)与强化学习(RL)结合对性能的提升作用。然而,此类推理策略在视觉生成领域的应用仍属空白。本文提出T2I-R1——一种基于双层次CoT推理与强化学习的新型文本生成图像模型。具体而言,我们识别出可优化不同生成阶段的两层CoT:(1)用于提示词高层规划的语义级CoT;(2)在逐块生成过程中处理底层像素的令牌级CoT。为协调这两层CoT,我们引入集成生成奖励的BiCoT-GRPO算法,实现在同一训练步骤中同步优化双重生成推理链。将本推理策略应用于基线模型Janus-Pro后,在T2I-CompBench和WISE基准上分别取得13%和19%的性能提升,甚至超越当前最先进的FLUX-1模型。代码已开源:https://github.com/CaraJ7/T2I-R1
Artificial Scientific Discovery
Abstract
arXiv:2411.11672v2 Announce Type: replace Abstract: Rooted in the explosion of deep learning over the past decade, this thesis spans from AlphaGo to ChatGPT to empirically examine the fundamental concepts needed to realize the vision of an artificial scientist: a machine with the capacity to autonomously generate original research and contribute to the expansion of human knowledge. The investigation begins with Olivaw, an AlphaGo Zero-like agent that discovers Othello knowledge from scratch but is unable to communicate it. This realization leads to the development of the Explanatory Learning (EL) framework, a formalization of the problem faced by a scientist when trying to explain a new phenomenon to their peers. The effective EL prescriptions allow us to crack Zendo, a popular board game simulating the scientific endeavor. This success comes with a fundamental insight: an artificial scientist must develop its own interpretation of the language used to explain its findings, and not rely on a rigid existing interpreter. Questioning the very process of learning an interpreter, we turn our attention to the inner functioning of modern multimodal models. This culminates in a simple idea to build CLIP-like models where interpretation and perception are explicitly disentangled: a cost-effective approach that couples two unimodal models using little multimodal data and no further training. Finally, we discuss what ChatGPT and its siblings are still missing to become artificial scientists, and introduce the Big-Bench Symbol Interpretation Task, a benchmark about interpreting Zendo-like explanations that sees LLMs going no further than random chance while being instead fully solved by humans.
摘要
基于过去十年深度学习的爆发式发展,本论文从AlphaGo到ChatGPT展开实证研究,探讨实现'人工科学家'愿景所需的基础概念——即一种能够自主生成原创研究并推动人类知识边界拓展的机器系统。研究始于Olivaw(一个类AlphaGo Zero的智能体),该智能体虽能从零开始发现奥赛罗棋知识,却无法进行知识传递。这一发现促使我们建立了'解释性学习'(EL)框架,用以形式化科学家向同行解释新现象时面临的核心问题。通过有效的EL方法,我们成功破解了模拟科研过程的经典棋盘游戏Zendo。这一突破带来关键启示:人工科学家必须自主构建其发现成果的解释语言体系,而非依赖固化的现有解释器。为探究解释器的学习机制本质,我们将研究焦点转向现代多模态模型的内部运作原理,最终提出一种构建类CLIP模型的简洁方案——通过极少量多模态数据且无需额外训练,将两个单模态模型高效耦合,实现解释与感知的显式解耦。最后,我们分析了ChatGPT等系统成为人工科学家尚存的缺陷,并推出Big-Bench符号解释任务基准测试:该测试要求模型解释类Zendo规则说明,实验显示大型语言模型的性能仅达随机水平,而人类却能完全解决。
Instantiation-based Formalization of Logical Reasoning Tasks using Language Models and Logical Solvers
Abstract
arXiv:2501.16961v2 Announce Type: replace Abstract: Robustness of reasoning remains a significant challenge for large language models, and addressing it is essential for the practical applicability of AI-driven reasoning systems. We introduce Semantic Self-Verification (SSV), a novel approach that addresses the key challenge in combining language models with the rigor of logical solvers: to accurately formulate the reasoning problem from natural language to the formal language of the solver. SSV uses a consistency-based approach to produce strong abstract formalizations of problems using concrete instantiations that are generated by the model and verified by the solver. In addition to significantly advancing the overall reasoning accuracy over the state-of-the-art, a key novelty that this approach presents is a feature of verification that has near-perfect precision over a significant coverage of cases, as we demonstrate on open reasoning benchmarks. We propose such near-certain reasoning as a new approach to reduce the need for manual verification in many cases, taking us closer to more dependable and autonomous AI reasoning systems.
摘要
推理的鲁棒性仍是大型语言模型面临的重大挑战,解决该问题对AI驱动推理系统的实际应用至关重要。我们提出语义自验证(SSV)这一创新方法,旨在解决语言模型与逻辑求解器严谨性结合的核心难题:如何将自然语言表述的推理问题准确转化为求解器的形式化语言。SSV采用基于一致性的方法,通过模型生成并经求解器验证的具体实例,产生问题的强抽象形式化表达。该方法不仅显著提升了当前最先进技术的整体推理准确率,其关键创新在于引入了一种验证机制——如我们在开放推理基准测试中所展示的——该机制在覆盖大量案例的同时保持近乎完美的精确度。我们提出这种近确定性推理作为新范式,可在多数情况下减少人工验证需求,从而推动构建更可靠、更自主的AI推理系统。
Fitness Landscape of Large Language Model-Assisted Automated Algorithm Search
Abstract
arXiv:2504.19636v2 Announce Type: replace Abstract: Large Language Models (LLMs) have demonstrated significant potential in algorithm design. However, when integrated into search frameworks for iterative algorithm search, the underlying fitness landscape--critical for understanding search behaviou--remains underexplored. In this paper, we illustrate and analyze the fitness landscape of LLM-assisted Algorithm Search (LAS) using a graph-based approach, where nodes represent algorithms and edges denote transitions between them. We conduct extensive evaluations across six algorithm design tasks and six commonly used LLMs. Our findings reveal that LAS landscapes are highly multimodal and rugged, particularly in combinatorial optimization tasks, with distinct structural variations across tasks and LLMs. For instance, heuristic design tasks exhibit dense clusters of high-performing algorithms, while symbolic regression tasks show sparse, scattered distributions. Additionally, we demonstrate how population size influences exploration-exploitation trade-offs and the evolving trajectory of elite algorithms. These insights not only advance our understanding of LAS landscapes but also provide practical guidance for designing more effective LAS methods.
摘要
大语言模型(LLMs)在算法设计领域展现出显著潜力。然而,当将其集成至迭代式算法搜索的搜索框架时,作为理解搜索行为关键基础的目标适应度景观仍缺乏深入探究。本文采用基于图论的方法对LLM辅助算法搜索(LAS)的适应度景观进行可视化与分析,其中节点代表算法,边表示算法间的转换关系。我们在六类算法设计任务和六种常用LLM上展开全面评估,发现LAS景观具有高度多模态性和崎岖性(尤其在组合优化任务中),且不同任务与LLM间存在显著结构差异。例如启发式设计任务呈现高性能算法的密集聚类,而符号回归任务则表现出稀疏分散的分布特征。此外,我们论证了种群规模如何影响探索-开发权衡机制以及精英算法的演化轨迹。这些发现不仅深化了对LAS景观的理论认知,更为设计高效LAS方法提供了实践指导。
EvoPrompt: Connecting LLMs with Evolutionary Algorithms Yields Powerful Prompt Optimizers
Abstract
arXiv:2309.08532v3 Announce Type: replace-cross Abstract: Large Language Models (LLMs) excel in various tasks, but they rely on carefully crafted prompts that often demand substantial human effort. To automate this process, in this paper, we propose a novel framework for discrete prompt optimization, called EvoPrompt, which borrows the idea of evolutionary algorithms (EAs) as they exhibit good performance and fast convergence. To enable EAs to work on discrete prompts, which are natural language expressions that need to be coherent and human-readable, we connect LLMs with EAs. This approach allows us to simultaneously leverage the powerful language processing capabilities of LLMs and the efficient optimization performance of EAs. Specifically, abstaining from any gradients or parameters, EvoPrompt starts from a population of prompts and iteratively generates new prompts with LLMs based on the evolutionary operators, improving the population based on the development set. We optimize prompts for both closed- and open-source LLMs including GPT-3.5 and Alpaca, on 31 datasets covering language understanding, generation tasks, as well as BIG-Bench Hard (BBH) tasks. EvoPrompt significantly outperforms human-engineered prompts and existing methods for automatic prompt generation (e.g., up to 25% on BBH). Furthermore, EvoPrompt demonstrates that connecting LLMs with EAs creates synergies, which could inspire further research on the combination of LLMs and conventional algorithms.
摘要
大型语言模型(LLMs)在各类任务中表现卓越,但其性能依赖于需耗费大量人工精心设计的提示词。为实现该过程的自动化,本文提出一种名为EvoPrompt的离散提示词优化新框架,其借鉴了具有优异性能与快速收敛特性的进化算法(EAs)思想。为使进化算法能够处理需保持连贯性与人类可读性的自然语言离散提示词,我们将LLMs与EAs相结合。这种方法使我们能同时利用LLMs强大的语言处理能力和EAs的高效优化性能。具体而言,EvoPrompt无需任何梯度或参数,从初始提示词种群出发,基于进化算子通过LLMs迭代生成新提示词,并根据开发集持续改进种群。我们在包含语言理解、生成任务及BIG-Bench Hard(BBH)任务的31个数据集上,针对GPT-3.5和Alpaca等闭源与开源LLMs进行了提示词优化。实验表明EvoPrompt显著优于人工设计的提示词和现有自动提示生成方法(例如在BBH任务上最高提升25%)。此外,本研究证实LLMs与EAs的结合能产生协同效应,这将为LLMs与传统算法的进一步融合研究提供启示。
Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models
Abstract
arXiv:2402.07033v3 Announce Type: replace-cross Abstract: Large Language Models (LLMs) with the Mixture-of-Experts (MoE) architectures have shown promising performance on various tasks. However, due to the huge model sizes, running them in resource-constrained environments where the GPU memory is not abundant is challenging. Some existing systems propose to use CPU resources to solve that, but they either suffer from the significant overhead of frequently moving data between CPU and GPU, or fail to consider distinct characteristics of CPUs and GPUs. This paper proposes Fiddler, a resource-efficient inference system for MoE models with limited GPU resources. Fiddler strategically utilizes CPU and GPU resources by determining the optimal execution strategy. Our evaluation shows that, unlike state-of-the-art systems that optimize for specific scenarios such as single batch inference or long prefill, Fiddler performs better in all scenarios. Compared against different baselines, Fiddler achieves 1.26 times speed up in single batch inference, 1.30 times in long prefill processing, and 11.57 times in beam search inference. The code of Fiddler is publicly available at https://github.com/efeslab/fiddler.
摘要
采用混合专家架构(Mixture-of-Experts, MoE)的大型语言模型(LLMs)在各种任务中展现出优异性能。然而,由于模型规模庞大,在GPU内存有限的资源受限环境中运行这些模型具有挑战性。现有系统尝试利用CPU资源解决该问题,但存在CPU与GPU间频繁数据迁移导致的显著开销,或未能充分考虑CPU与GPU的特性差异。本文提出Fiddler——一种面向GPU资源受限环境下MoE模型的高效推理系统。Fiddler通过确定最优执行策略,实现CPU与GPU资源的协同利用。评估结果表明:与专为单批次推理或长前缀处理等特定场景优化的现有系统不同,Fiddler在所有场景下均表现更优。相较于不同基线系统,Fiddler在单批次推理中实现1.26倍加速,长前缀处理中达1.30倍,束搜索推理中提升11.57倍。Fiddler代码已开源:https://github.com/efeslab/fiddler。
LoRATK: LoRA Once, Backdoor Everywhere in the Share-and-Play Ecosystem
Abstract
arXiv:2403.00108v2 Announce Type: replace-cross Abstract: Finetuning LLMs with LoRA has gained significant popularity due to its simplicity and effectiveness. Often, users may even find pluggable, community-shared LoRAs to enhance their base models for a specific downstream task of interest; enjoying a powerful, efficient, yet customized LLM experience with negligible investment. However, this convenient share-and-play ecosystem also introduces a new attack surface, where attackers can distribute malicious LoRAs to a community eager to try out shared assets. Despite the high-risk potential, no prior art has comprehensively explored LoRA's attack surface under the downstream-enhancing share-and-play context. In this paper, we investigate how backdoors can be injected into task-enhancing LoRAs and examine the mechanisms of such infections. We find that with a simple, efficient, yet specific recipe, a backdoor LoRA can be trained once and then seamlessly merged (in a training-free fashion) with multiple task-enhancing LoRAs, retaining both its malicious backdoor and benign downstream capabilities. This allows attackers to scale the distribution of compromised LoRAs with minimal effort by leveraging the rich pool of existing shared LoRA assets. We note that such merged LoRAs are particularly infectious -- because their malicious intent is cleverly concealed behind improved downstream capabilities, creating a strong incentive for voluntary download -- and dangerous -- because under local deployment, no safety measures exist to intervene when things go wrong. Our work is among the first to study this new threat model of training-free distribution of downstream-capable-yet-backdoor-injected LoRAs, highlighting the urgent need for heightened security awareness in the LoRA ecosystem. Warning: This paper contains offensive content and involves a real-life tragedy.
摘要
摘要:基于LoRA的LLMs微调因其简便高效而广受欢迎。用户甚至可以直接插入社区共享的LoRA模块,以增强基础模型在特定下游任务中的表现,从而以近乎零成本获得强大、高效且定制化的LLM体验。然而这种便捷的共享使用生态也催生了新的攻击面——攻击者可向热衷尝试共享资源的社区分发恶意LoRA模块。尽管风险极高,现有研究尚未系统探讨下游增强场景下LoRA的攻击面。本文研究了如何在后门注入任务增强型LoRA模块,并剖析其感染机制。我们发现通过一种简单高效的特殊方法,训练一次的后门LoRA即可无缝(以免训练方式)与多个任务增强型LoRA合并,同时保留恶意后门与良性下游能力。这使得攻击者能借助现有共享LoRA资源库,以最小成本规模化分发被篡改的模块。值得注意的是,此类合并后的LoRA具有极强传染性——因其恶意意图被巧妙隐藏在提升的下游能力背后,形成强烈的自愿下载诱因;同时极度危险——本地部署环境下,当问题发生时缺乏任何安全干预机制。本研究首次系统探讨了这种免训练分发兼具下游能力与后门注入的LoRA新威胁模型,揭示了提升LoRA生态系统安全意识的紧迫性。警告:本文包含攻击性内容并涉及真实悲剧事件。
Large Language Model Agent as a Mechanical Designer
Abstract
arXiv:2404.17525v3 Announce Type: replace-cross Abstract: Conventional mechanical design follows an iterative process in which initial concepts are refined through cycles of expert assessment and resource-intensive Finite Element Method (FEM) analysis to meet performance goals. While machine learning models have been developed to assist in parts of this process, they typically require large datasets, extensive training, and are often tailored to specific tasks, limiting their generalizability. To address these limitations, we propose a framework that leverages a pretrained Large Language Model (LLM) in conjunction with an FEM module to autonomously generate, evaluate, and refine structural designs based on performance specifications and numerical feedback. The LLM operates without domain-specific fine-tuning, using general reasoning to propose design candidates, interpret FEM-derived performance metrics, and apply structurally sound modifications. Using 2D truss structures as a testbed, we show that the LLM can effectively navigate highly discrete and multi-faceted design spaces, balance competing objectives, and identify convergence when further optimization yields diminishing returns. Compared to Non-dominated Sorting Genetic Algorithm II (NSGA-II), our method achieves faster convergence and fewer FEM evaluations. Experiments with varying temperature settings (0.5, 1.0, 1.2) and model sizes (GPT-4.1 and GPT-4.1-mini) indicate that smaller models yield higher constraint satisfaction with fewer steps, while lower temperatures enhance design consistency. These results establish LLMs as a promising new class of reasoning-based, natural language-driven optimizers for autonomous design and iterative structural refinement.
摘要
传统机械设计遵循迭代流程,初始概念需通过专家评估与资源密集的有限元分析(FEM)循环改进以满足性能目标。尽管已有机器学习模型辅助部分流程,但这些模型通常需要大量数据集、长时间训练,且多针对特定任务定制,泛化能力有限。为解决这些局限,我们提出一个框架,利用预训练大语言模型(LLM)结合FEM模块,基于性能指标与数值反馈自主生成、评估与优化结构设计。该LLM无需领域微调,通过通用推理提出设计方案、解析FEM性能指标并应用结构合理的修改。以二维桁架结构为测试平台,我们证明LLM能有效探索高度离散、多维的设计空间,平衡竞争目标,并在优化收益递减时识别收敛点。相比非支配排序遗传算法II(NSGA-II),本方法收敛更快且FEM评估次数更少。不同温度参数(0.5、1.0、1.2)与模型规模(GPT-4.1和GPT-4.1-mini)实验表明:较小模型能以更少步骤实现更高约束满足率,较低温度则提升设计一致性。这些结果确立了LLM作为新型基于推理、自然语言驱动的自主设计与迭代结构优化工具的潜力。
QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving
Abstract
arXiv:2405.04532v3 Announce Type: replace-cross Abstract: Quantization can accelerate large language model (LLM) inference. Going beyond INT8 quantization, the research community is actively exploring even lower precision, such as INT4. Nonetheless, state-of-the-art INT4 quantization techniques only accelerate low-batch, edge LLM inference, failing to deliver performance gains in large-batch, cloud-based LLM serving. We uncover a critical issue: existing INT4 quantization methods suffer from significant runtime overhead (20-90%) when dequantizing either weights or partial sums on GPUs. To address this challenge, we introduce QoQ, a W4A8KV4 quantization algorithm with 4-bit weight, 8-bit activation, and 4-bit KV cache. QoQ stands for quattuor-octo-quattuor, which represents 4-8-4 in Latin. QoQ is implemented by the QServe inference library that achieves measured speedup. The key insight driving QServe is that the efficiency of LLM serving on GPUs is critically influenced by operations on low-throughput CUDA cores. Building upon this insight, in QoQ algorithm, we introduce progressive quantization that can allow low dequantization overhead in W4A8 GEMM. Additionally, we develop SmoothAttention to effectively mitigate the accuracy degradation incurred by 4-bit KV quantization. In the QServe system, we perform compute-aware weight reordering and take advantage of register-level parallelism to reduce dequantization latency. We also make fused attention memory-bound, harnessing the performance gain brought by KV4 quantization. As a result, QServe improves the maximum achievable serving throughput of Llama-3-8B by 1.2x on A100, 1.4x on L40S; and Qwen1.5-72B by 2.4x on A100, 3.5x on L40S, compared to TensorRT-LLM. Remarkably, QServe on L40S GPU can achieve even higher throughput than TensorRT-LLM on A100. Thus, QServe effectively reduces the dollar cost of LLM serving by 3x. Code is available at https://github.com/mit-han-lab/omniserve.
摘要
量化技术能够加速大语言模型(LLM)的推理过程。在超越INT8量化之后,研究界正积极探索更低精度的方案,例如INT4。然而,最先进的INT4量化技术仅能加速小批量、边缘端的LLM推理,无法在大批量、云端的LLM服务中带来性能提升。我们发现一个关键问题:现有INT4量化方法在GPU上对权重或部分和进行反量化时会产生显著运行时开销(20-90%)。为解决这一挑战,我们提出QoQ算法——一种采用4比特权重(W4)、8比特激活(A8)和4比特KV缓存(KV4)的量化方案。QoQ得名于拉丁语"quattuor-octo-quattuor",代表4-8-4的数值组合。该算法由QServe推理库实现并实测获得加速效果。
Folded Context Condensation in Path Integral Formalism for Infinite Context Transformers
Abstract
arXiv:2405.04620v5 Announce Type: replace-cross Abstract: In this work, we present a generalized formulation of the Transformer algorithm by reinterpreting its core mechanisms within the framework of Path Integral formalism. In this perspective, the attention mechanism is recast as a process that integrates all possible transition paths leading to future token states, with temporal evolution governed by the Feed-Forward Network. By systematically mapping each component of the Transformer to its counterpart in the Path Integral formulation, we obtain a more compact and efficient representation, in which the contextual information of a sequence is condensed into memory-like segments. These segments are recurrently processed across Transformer layers, enabling more effective long-term information retention. We validate the effectiveness of this approach through the Passkey retrieval task and a summarization task, demonstrating that the proposed method preserves historical information while exhibiting memory usage that scales linearly with sequence length. This contrasts with the non-linear memory growth typically observed in standard attention mechanisms. We expect that this quantum-inspired generalization of the Transformer architecture will open new avenues for enhancing both the efficiency and expressiveness of future Transformer models.
摘要
在本研究中,我们通过将Transformer算法的核心机制重新诠释为路径积分形式框架,提出了一种广义化的Transformer算法表述。该视角下,注意力机制被重构为对所有可能通向未来标记状态的转移路径进行积分的过程,其时间演化由前馈网络控制。通过系统地将Transformer各组件映射到路径积分表述中的对应部分,我们获得了一种更紧凑高效的表示形式——其中序列的上下文信息被压缩为类记忆片段。这些片段在Transformer各层间进行递归处理,从而实现更有效的长程信息保留。我们通过密钥检索任务和文本摘要任务验证了该方法的有效性,结果表明所提方法在保持历史信息的同时,其内存使用量与序列长度呈线性增长关系。这与标准注意力机制中常见的非线性内存增长形成鲜明对比。我们预期这种受量子力学启发的Transformer架构广义化,将为提升未来Transformer模型的效率和表达能力开辟新途径。
Automated Review Generation Method Based on Large Language Models
Abstract
arXiv:2407.20906v5 Announce Type: replace-cross Abstract: Literature research, vital for scientific work, faces the challenge of surging information volumes exceeding researchers' processing capabilities. We present an automated review generation method based on large language models (LLMs) to overcome efficiency bottlenecks and reduce cognitive load. Our statistically validated evaluation framework demonstrates that the generated reviews match or exceed manual quality, offering broad applicability across research fields without requiring users' domain knowledge. Applied to propane dehydrogenation (PDH) catalysts, our method swiftly analyzed 343 articles, averaging seconds per article per LLM account, producing comprehensive reviews spanning 35 topics, with extended analysis of 1041 articles providing insights into catalysts' properties. Through multi-layered quality control, we effectively mitigated LLMs' hallucinations, with expert verification confirming accuracy and citation integrity while demonstrating hallucination risks reduced to below 0.5% with 95% confidence. Released Windows application enables one-click review generation, enhancing research productivity and literature recommendation efficiency while setting the stage for broader scientific explorations.
摘要
文献研究作为科研工作的核心环节,正面临信息量激增超越研究者处理能力的挑战。本研究提出基于大语言模型(LLMs)的自动化综述生成方法,旨在突破效率瓶颈并降低认知负荷。通过统计学验证的评估框架表明,生成综述的质量达到或超越人工水平,且无需用户具备领域知识即可跨学科广泛应用。将该方法应用于丙烷脱氢(PDH)催化剂领域,系统在平均单账号每秒处理速度下快速分析343篇文献,生成涵盖35个主题的综合性综述,并对1041篇文献的扩展分析揭示了催化剂特性规律。通过多层质量控制机制有效抑制LLMs的幻觉现象,专家验证确认内容准确性及引证完整性,在95%置信度下将幻觉风险控制在0.5%以内。发布的Windows应用程序支持一键生成综述,在提升科研生产力和文献推荐效率的同时,为更广泛的科学探索奠定了基础。
Reward-Augmented Data Enhances Direct Preference Alignment of LLMs
Abstract
arXiv:2410.08067v4 Announce Type: replace-cross Abstract: Preference alignment in Large Language Models (LLMs) has significantly improved their ability to adhere to human instructions and intentions. However, existing direct alignment algorithms primarily focus on relative preferences and often overlook the qualitative aspects of responses, despite having access to preference data that includes reward scores from judge models during AI feedback. Striving to maximize the implicit reward gap between the chosen and the slightly inferior rejected responses can cause overfitting and unnecessary unlearning of the high-quality rejected responses. The unawareness of the reward scores also drives the LLM to indiscriminately favor the low-quality chosen responses and fail to generalize to optimal responses that are sparse in data. To overcome these shortcomings, our study introduces reward-conditioned LLM policies that discern and learn from the entire spectrum of response quality within the dataset, helping extrapolate to more optimal regions. We propose an effective yet simple data relabeling method that conditions the preference pairs on quality scores to construct a reward-augmented dataset. The experiments across various benchmarks and diverse models demonstrate that our approach consistently boosts DPO by a considerable margin. Through comprehensive ablation studies, we demonstrate that our method not only maximizes the utility of preference data but also mitigates the issue of unlearning, demonstrating its broad effectiveness beyond mere data expansion. Our code is available at https://github.com/shenao-zhang/reward-augmented-preference.
摘要
大型语言模型(LLMs)的偏好对齐显著提升了其遵循人类指令与意图的能力。然而,现有直接对齐算法主要关注相对偏好,尽管在AI反馈过程中可获得包含评判模型奖励分数的偏好数据,却常忽略响应的质性特征。单纯追求选定响应与稍逊拒绝响应间隐含奖励差距的最大化,可能导致过拟合以及对高质量拒绝响应不必要的遗忘。对奖励分数的忽视还会驱使LLM盲目偏好低质量选定响应,难以泛化至数据稀疏的最优响应。为克服这些缺陷,本研究提出奖励条件化LLM策略,通过识别并学习数据集中响应质量的完整分布,助力模型外推至更优区域。我们设计了一种高效而简单的数据重标注方法,将偏好对与质量分数相绑定以构建奖励增强数据集。跨多种基准与异构模型的实验表明,该方法持续以显著优势提升DPO性能。通过系统消融研究,我们证明该方法不仅能最大化偏好数据的效用,还可缓解遗忘问题,其广泛有效性超越了单纯的数据扩展。代码已开源:https://github.com/shenao-zhang/reward-augmented-preference。
Generating Traffic Scenarios via In-Context Learning to Learn Better Motion Planner
Abstract
arXiv:2412.18086v2 Announce Type: replace-cross Abstract: Motion planning is a crucial component in autonomous driving. State-of-the-art motion planners are trained on meticulously curated datasets, which are not only expensive to annotate but also insufficient in capturing rarely seen critical scenarios. Failing to account for such scenarios poses a significant risk to motion planners and may lead to incidents during testing. An intuitive solution is to manually compose such scenarios by programming and executing a simulator (e.g., CARLA). However, this approach incurs substantial human costs. Motivated by this, we propose an inexpensive method for generating diverse critical traffic scenarios to train more robust motion planners. First, we represent traffic scenarios as scripts, which are then used by the simulator to generate traffic scenarios. Next, we develop a method that accepts user-specified text descriptions, which a Large Language Model translates into scripts using in-context learning. The output scripts are sent to the simulator that produces the corresponding traffic scenarios. As our method can generate abundant safety-critical traffic scenarios, we use them as synthetic training data for motion planners. To demonstrate the value of generated scenarios, we train existing motion planners on our synthetic data, real-world datasets, and a combination of both. Our experiments show that motion planners trained with our data significantly outperform those trained solely on real-world data, showing the usefulness of our synthetic data and the effectiveness of our data generation method. Our source code is available at https://ezharjan.github.io/AutoSceneGen.
摘要
运动规划是自动驾驶中的关键组件。当前最先进的运动规划器基于精心标注的数据集进行训练,这些数据集不仅标注成本高昂,而且难以涵盖罕见的关键场景。忽视此类场景会给运动规划器带来重大风险,并可能导致测试阶段发生事故。一种直观解决方案是通过编程运行模拟器(如CARLA)手动构建此类场景,但这种方法需要耗费大量人力。为此,我们提出一种低成本方法用于生成多样化关键交通场景,以训练更具鲁棒性的运动规划器。首先,我们将交通场景表示为脚本,由模拟器根据脚本生成交通场景。接着,我们开发的方法可接收用户指定的文本描述,通过大型语言模型利用上下文学习将其转换为脚本。输出脚本被发送至模拟器以生成对应交通场景。由于本方法能生成大量安全关键交通场景,我们将其作为运动规划器的合成训练数据。为验证生成场景的价值,我们在合成数据、真实数据集及二者组合上分别训练现有运动规划器。实验表明,使用本方法数据训练的运动规划器显著优于仅使用真实数据训练的版本,这既证明了合成数据的实用性,也验证了我们数据生成方法的有效性。
A Comprehensive Survey on Integrating Large Language Models with Knowledge-Based Methods
Abstract
arXiv:2501.13947v3 Announce Type: replace-cross Abstract: The rapid development of artificial intelligence has led to marked progress in the field. One interesting direction for research is whether Large Language Models (LLMs) can be integrated with structured knowledge-based systems. This approach aims to combine the generative language understanding of LLMs and the precise knowledge representation systems by which they are integrated. This article surveys the relationship between LLMs and knowledge bases, looks at how they can be applied in practice, and discusses related technical, operational, and ethical challenges. Utilizing a comprehensive examination of the literature, the study both identifies important issues and assesses existing solutions. It demonstrates the merits of incorporating generative AI into structured knowledge-base systems concerning data contextualization, model accuracy, and utilization of knowledge resources. The findings give a full list of the current situation of research, point out the main gaps, and propose helpful paths to take. These insights contribute to advancing AI technologies and support their practical deployment across various sectors.
摘要
人工智能的快速发展推动了该领域的显著进步。一个值得关注的研究方向是大型语言模型(LLMs)能否与结构化知识系统相融合。该方法旨在将LLMs的生成式语言理解能力与精确的知识表征系统相结合。本文系统考察了LLMs与知识库之间的关联,探讨了其实践应用路径,并分析了相关的技术、操作及伦理挑战。通过文献综述,本研究既识别出关键问题,又评估了现有解决方案。研究论证了将生成式人工智能整合到结构化知识库系统在数据情境化、模型准确性及知识资源利用方面的优势。研究结果全面梳理了当前研究现状,指出主要空白领域,并提出了可行的研究方向。这些见解有助于推动人工智能技术发展,并支持其在各领域的实际应用部署。
HSI: Head-Specific Intervention Can Induce Misaligned AI Coordination in Large Language Models
Abstract
arXiv:2502.05945v2 Announce Type: replace-cross Abstract: Robust alignment guardrails for large language models are becoming increasingly important with their widespread application. In contrast to previous studies, we demonstrate that inference-time activation interventions can bypass safety alignments and effectively steer model generations towards harmful AI coordination for Llama 2. Our method applies fine-grained interventions at specific model subcomponents, particularly attention heads, using a simple binary choice probing strategy. These interventions then generalise to the open-ended generation setting effectively circumventing safety guardrails. We show that probing single attention heads is more effective than intervening on full layers and intervening on only four attention heads is comparable to supervised fine-tuning. We further show that only a few example completions are needed to compute effective steering directions, which is an advantage over classical fine-tuning. Our findings highlight the shortcomings of current alignment techniques. In addition, our results suggest that, at the attention head level, activations encode fine-grained linearly separable behaviors. Practically, the approach offers a straightforward methodology to steer large language model behaviour, which could be extended to diverse domains beyond safety requiring fine-grained control over the model output. The code and datasets for this study can be found on https://github.com/PaulDrm/targeted_intervention.
摘要
随着大型语言模型的广泛应用,其鲁棒对齐防护机制的重要性日益凸显。与以往研究不同,我们证明推理阶段的激活干预能够绕过Llama 2的安全对齐机制,有效引导模型生成有害的AI协同行为。本方法采用简单的二元选择探测策略,在特定模型子组件(尤其是注意力头)实施细粒度干预。这些干预能泛化至开放式生成场景,成功规避安全防护机制。研究表明:对单个注意力头进行探测比全层干预更有效,仅干预四个注意力头即可达到与监督微调相当的效果;计算有效引导方向仅需少量示例补全,较传统微调更具优势。这些发现揭示了现有对齐技术的缺陷,同时表明在注意力头层面,激活编码了线性可分离的细粒度行为特征。该方法为引导大语言模型行为提供了简洁方案,可扩展至安全领域外需要精细控制模型输出的多样化场景。本研究代码与数据集详见https://github.com/PaulDrm/targeted_intervention。
Caught in the Web of Words: Do LLMs Fall for Spin in Medical Literature?
Abstract
arXiv:2502.07963v2 Announce Type: replace-cross Abstract: Medical research faces well-documented challenges in translating novel treatments into clinical practice. Publishing incentives encourage researchers to present "positive" findings, even when empirical results are equivocal. Consequently, it is well-documented that authors often spin study results, especially in article abstracts. Such spin can influence clinician interpretation of evidence and may affect patient care decisions. In this study, we ask whether the interpretation of trial results offered by Large Language Models (LLMs) is similarly affected by spin. This is important since LLMs are increasingly being used to trawl through and synthesize published medical evidence. We evaluated 22 LLMs and found that they are across the board more susceptible to spin than humans. They might also propagate spin into their outputs: We find evidence, e.g., that LLMs implicitly incorporate spin into plain language summaries that they generate. We also find, however, that LLMs are generally capable of recognizing spin, and can be prompted in a way to mitigate spin's impact on LLM outputs.
摘要
医学研究在将新型疗法转化为临床实践方面面临诸多公认挑战。发表机制激励研究者呈现"阳性"结果,即便实证数据模棱两可。大量文献表明,作者常对研究结果进行选择性报道(spin),尤见于论文摘要。这种倾向性表述可能影响临床医生对证据的解读,进而干扰诊疗决策。本研究探讨大型语言模型(LLMs)对临床试验结果的解读是否同样受倾向性表述影响,这一问题至关重要,因为LLMs正日益用于医学证据的检索与整合。我们对22个LLMs进行评估,发现其普遍比人类更易受倾向性表述影响。模型还可能将倾向性传递至输出内容:例如有证据表明,LLMs会将其隐性融入生成的通俗摘要中。但研究同时发现,LLMs基本具备识别倾向性表述的能力,通过特定提示可减轻其对模型输出的影响。
SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference
Abstract
arXiv:2502.18137v2 Announce Type: replace-cross Abstract: An efficient attention implementation is essential for large models due to its quadratic time complexity. Fortunately, attention commonly exhibits sparsity, i.e., many values in the attention map are near zero, allowing for the omission of corresponding computations. Many studies have utilized the sparse pattern to accelerate attention. However, most existing works focus on optimizing attention within specific models by exploiting certain sparse patterns of the attention map. A universal sparse attention that guarantees both the speedup and end-to-end performance of diverse models remains elusive. In this paper, we propose SpargeAttn, a universal sparse and quantized attention for any model. Our method uses a two-stage online filter: in the first stage, we rapidly and accurately predict the attention map, enabling the skip of some matrix multiplications in attention. In the second stage, we design an online softmax-aware filter that incurs no extra overhead and further skips some matrix multiplications. Experiments show that our method significantly accelerates diverse models, including language, image, and video generation, without sacrificing end-to-end metrics. The codes are available at https://github.com/thu-ml/SpargeAttn.
摘要
高效的注意力机制实现对于大型模型至关重要,因其具有二次时间复杂度。幸运的是,注意力通常表现出稀疏性,即注意力图中许多值接近于零,从而可以省略相应计算。已有许多研究利用这种稀疏模式加速注意力计算。然而,现有工作大多通过利用注意力图的特定稀疏模式来优化特定模型内的注意力计算。一种既能保证加速效果又能保持多样化模型端到端性能的通用稀疏注意力机制仍未被实现。本文提出SpargeAttn,一种适用于任何模型的通用稀疏量化注意力机制。我们的方法采用两级在线过滤器:第一阶段快速准确地预测注意力图,从而跳过注意力计算中的部分矩阵乘法;第二阶段设计了一种无额外开销的在线softmax感知过滤器,进一步跳过部分矩阵乘法。实验表明,该方法在不牺牲端到端指标的前提下,显著加速了包括语言、图像和视频生成在内的多种模型。代码已开源:https://github.com/thu-ml/SpargeAttn。
UoR-NCL at SemEval-2025 Task 1: Using Generative LLMs and CLIP Models for Multilingual Multimodal Idiomaticity Representation
Abstract
arXiv:2502.20984v3 Announce Type: replace-cross Abstract: SemEval-2025 Task 1 focuses on ranking images based on their alignment with a given nominal compound that may carry idiomatic meaning in both English and Brazilian Portuguese. To address this challenge, this work uses generative large language models (LLMs) and multilingual CLIP models to enhance idiomatic compound representations. LLMs generate idiomatic meanings for potentially idiomatic compounds, enriching their semantic interpretation. These meanings are then encoded using multilingual CLIP models, serving as representations for image ranking. Contrastive learning and data augmentation techniques are applied to fine-tune these embeddings for improved performance. Experimental results show that multimodal representations extracted through this method outperformed those based solely on the original nominal compounds. The fine-tuning approach shows promising outcomes but is less effective than using embeddings without fine-tuning. The source code used in this paper is available at https://github.com/tongwu17/SemEval-2025-Task1-UoR-NCL.
摘要
SemEval-2025任务1的核心是根据图像与给定名词性复合词的匹配程度进行排序,这些复合词在英语和巴西葡萄牙语中可能具有习语含义。为应对这一挑战,本研究采用生成式大型语言模型(LLMs)和多语言CLIP模型来增强习语复合词的表示。LLMs为潜在具有习语含义的复合词生成习语解释,从而丰富其语义理解。随后,这些解释通过多语言CLIP模型进行编码,作为图像排序的表示。研究应用对比学习和数据增强技术对这些嵌入进行微调以提升性能。实验结果表明,通过此方法提取的多模态表示优于仅基于原始名词性复合词的表示。微调方法显示出有希望的结果,但效果不及直接使用未微调的嵌入。本文使用的源代码可在https://github.com/tongwu17/SemEval-2025-Task1-UoR-NCL获取。
Dynamic Parametric Retrieval Augmented Generation for Test-time Knowledge Enhancement
Abstract
arXiv:2503.23895v2 Announce Type: replace-cross Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) by retrieving relevant documents from external sources and incorporating them into the context. While it improves reliability by providing factual texts, it significantly increases inference costs as context length grows and introduces challenging issue of RAG hallucination, primarily caused by the lack of corresponding parametric knowledge in LLMs. An efficient solution is to enhance the knowledge of LLMs at test-time. Parametric RAG (PRAG) addresses this by embedding document into LLMs parameters to perform test-time knowledge enhancement, effectively reducing inference costs through offline training. However, its high training and storage costs, along with limited generalization ability, significantly restrict its practical adoption. To address these challenges, we propose Dynamic Parametric RAG (DyPRAG), a novel framework that leverages a lightweight parameter translator model to efficiently convert documents into parametric knowledge. DyPRAG not only reduces inference, training, and storage costs but also dynamically generates parametric knowledge, seamlessly enhancing the knowledge of LLMs and resolving knowledge conflicts in a plug-and-play manner at test-time. Extensive experiments on multiple datasets demonstrate the effectiveness and generalization capabilities of DyPRAG, offering a powerful and practical RAG paradigm which enables superior knowledge fusion and mitigates RAG hallucination in real-world applications. Our code is available at https://github.com/Trae1ounG/DyPRAG.
摘要
检索增强生成(RAG)通过从外部源检索相关文档并将其融入上下文,增强了大型语言模型(LLMs)的能力。虽然它通过提供事实性文本来提高可靠性,但随着上下文长度的增加,推理成本显著上升,并引发了RAG幻觉这一棘手问题,这主要源于LLMs缺乏相应的参数化知识。一种高效的解决方案是在测试时增强LLMs的知识。参数化RAG(PRAG)通过将文档嵌入LLMs参数以实现测试时知识增强,通过离线训练有效降低推理成本。然而,其高昂的训练和存储成本,以及有限的泛化能力,严重制约了其实际应用。为解决这些挑战,我们提出动态参数化RAG(DyPRAG),这是一个新颖的框架,利用轻量级参数翻译器模型高效地将文档转化为参数化知识。DyPRAG不仅降低了推理、训练和存储成本,还能动态生成参数化知识,以即插即用的方式无缝增强LLMs的知识并解决测试时的知识冲突。在多个数据集上的大量实验证明了DyPRAG的有效性和泛化能力,为现实应用提供了强大且实用的RAG范式,实现了卓越的知识融合并缓解了RAG幻觉问题。我们的代码可在https://github.com/Trae1ounG/DyPRAG获取。
GPG: A Simple and Strong Reinforcement Learning Baseline for Model Reasoning
Abstract
arXiv:2504.02546v3 Announce Type: replace-cross Abstract: Reinforcement Learning (RL) can directly enhance the reasoning capabilities of large language models without extensive reliance on Supervised Fine-Tuning (SFT). In this work, we revisit the traditional Policy Gradient (PG) mechanism and propose a minimalist RL approach termed Group Policy Gradient (GPG). Unlike conventional methods, GPG directly optimize the original RL objective, thus obviating the need for surrogate loss functions. By eliminating the critic and reference models, avoiding KL divergence constraints, and addressing the advantage and gradient estimation bias, our approach significantly simplifies the training process compared to Group Relative Policy Optimization (GRPO). Our approach achieves superior performance without relying on auxiliary techniques or adjustments. As illustrated in Figure 1, extensive experiments demonstrate that our method not only reduces computational costs but also consistently outperforms GRPO across various unimodal and multimodal tasks. Our code is available at https://github.com/AMAP-ML/GPG.
摘要
强化学习(RL)无需过度依赖监督微调(SFT)即可直接增强大语言模型的推理能力。本研究重新审视传统策略梯度(PG)机制,提出一种称为组策略梯度(GPG)的极简RL方法。与传统方法不同,GPG直接优化原始RL目标,从而无需替代损失函数。通过消除评论家模型和参考模型、避免KL散度约束,并解决优势与梯度估计偏差问题,我们的方法相较于组相对策略优化(GRPO)显著简化了训练流程。该方法在不依赖辅助技术或调整的情况下实现了更优性能。如图1所示,大量实验表明我们的方法不仅降低了计算成本,还在各类单模态与多模态任务中持续超越GRPO。代码已开源:https://github.com/AMAP-ML/GPG。
ReasoningV: Efficient Verilog Code Generation with Adaptive Hybrid Reasoning Model
Abstract
arXiv:2504.14560v3 Announce Type: replace-cross Abstract: Large Language Models (LLMs) have advanced Verilog code generation significantly, yet face challenges in data quality, reasoning capabilities, and computational efficiency. This paper presents ReasoningV, a novel model employing a hybrid reasoning strategy that integrates trained intrinsic capabilities with dynamic inference adaptation for Verilog code generation. Our framework introduces three complementary innovations: (1) ReasoningV-5K, a high-quality dataset of 5,000 functionally verified instances with reasoning paths created through multi-dimensional filtering of PyraNet samples; (2) a two-stage training approach combining parameter-efficient fine-tuning for foundational knowledge with full-parameter optimization for enhanced reasoning; and (3) an adaptive reasoning mechanism that dynamically adjusts reasoning depth based on problem complexity, reducing token consumption by up to 75% while preserving performance. Experimental results demonstrate ReasoningV's effectiveness with a pass@1 accuracy of 57.8% on VerilogEval-human, achieving performance competitive with leading commercial models like Gemini-2.0-flash (59.5%) and exceeding the previous best open-source model by 10.4 percentage points. ReasoningV offers a more reliable and accessible pathway for advancing AI-driven hardware design automation, with our model, data, and code available at https://github.com/BUAA-CLab/ReasoningV.
摘要
大型语言模型(LLMs)在Verilog代码生成领域取得显著进展,但仍面临数据质量、推理能力和计算效率方面的挑战。本文提出ReasoningV模型,该模型采用混合推理策略,将训练获得的内在能力与动态推理适应相结合以实现Verilog代码生成。我们的框架包含三项创新:(1) ReasoningV-5K数据集——通过对PyraNet样本进行多维过滤构建的5,000个功能验证实例及其推理路径的高质量数据集;(2) 两阶段训练方法——结合参数高效微调的基础知识学习与全参数优化的增强推理训练;(3) 自适应推理机制——根据问题复杂度动态调整推理深度,在保持性能的同时降低最高75%的token消耗。实验结果表明,ReasoningV在VerilogEval-human基准测试中以57.8%的pass@1准确率展现卓越性能,与Gemini-2.0-flash(59.5%)等领先商业模型相当,并超越此前最佳开源模型10.4个百分点。ReasoningV为推进AI驱动的硬件设计自动化提供了更可靠、更易获取的技术路径,相关模型、数据及代码已开源:https://github.com/BUAA-CLab/ReasoningV。
Towards Optimal Circuit Generation: Multi-Agent Collaboration Meets Collective Intelligence
Abstract
arXiv:2504.14625v3 Announce Type: replace-cross Abstract: Large language models (LLMs) have transformed code generation, yet their application in hardware design produces gate counts 38%--1075% higher than human designs. We present CircuitMind, a multi-agent framework that achieves human-competitive efficiency through three key innovations: syntax locking (constraining generation to basic logic gates), retrieval-augmented generation (enabling knowledge-driven design), and dual-reward optimization (balancing correctness with efficiency). To evaluate our approach, we introduce TC-Bench, the first gate-level benchmark harnessing collective intelligence from the TuringComplete ecosystem -- a competitive circuit design platform with hundreds of thousands of players. Experiments show CircuitMind enables 55.6% of model implementations to match or exceed top-tier human experts in composite efficiency metrics. Most remarkably, our framework elevates the 14B Phi-4 model to outperform both GPT-4o mini and Gemini 2.0 Flash, achieving efficiency comparable to the top 25% of human experts without requiring specialized training. These innovations establish a new paradigm for hardware optimization where collaborative AI systems leverage collective human expertise to achieve optimal circuit designs. Our model, data, and code are open-source at https://github.com/BUAA-CLab/CircuitMind.
摘要
大型语言模型(LLMs)已彻底改变代码生成领域,但其在硬件设计中的应用会导致门电路数量比人工设计高出38%至1075%。我们提出CircuitMind——一个通过三项关键创新实现人类水平效率的多智能体框架:语法锁定(将生成约束至基本逻辑门)、检索增强生成(实现知识驱动设计)和双奖励优化(平衡正确性与效率)。为评估该方法,我们引入TC-Bench,这是首个利用TuringComplete生态系统集体智慧的门级基准测试平台(该竞争性电路设计平台拥有数十万参与者)。实验表明CircuitMind使55.6%的模型实现在综合效率指标上达到或超越顶级人类专家。最显著的是,我们的框架使14B参数的Phi-4模型表现优于GPT-4o mini和Gemini 2.0 Flash,在无需专门训练的情况下达到前25%人类专家的效率水平。这些创新建立了硬件优化的新范式,即协作式AI系统通过整合人类集体智慧来实现最优电路设计。我们的模型、数据及代码已在https://github.com/BUAA-CLab/CircuitMind开源。
A Framework for Testing and Adapting REST APIs as LLM Tools
Abstract
arXiv:2504.15546v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) are enabling autonomous agents to perform complex workflows using external tools or functions, often provided via REST APIs in enterprise systems. However, directly utilizing these APIs as tools poses challenges due to their complex input schemas, elaborate responses, and often ambiguous documentation. Current benchmarks for tool testing do not adequately address these complexities, leading to a critical gap in evaluating API readiness for agent-driven automation. In this work, we present a novel testing framework aimed at evaluating and enhancing the readiness of REST APIs to function as tools for LLM-based agents. Our framework transforms apis as tools, generates comprehensive test cases for the APIs, translates tests cases into natural language instructions suitable for agents, enriches tool definitions and evaluates the agent's ability t correctly invoke the API and process its inputs and responses. To provide actionable insights, we analyze the outcomes of 750 test cases, presenting a detailed taxonomy of errors, including input misinterpretation, output handling inconsistencies, and schema mismatches. Additionally, we classify these test cases to streamline debugging and refinement of tool integrations. This work offers a foundational step toward enabling enterprise APIs as tools, improving their usability in agent-based applications.
摘要
大型语言模型(LLMs)正推动自主代理通过外部工具或函数执行复杂工作流,这些工具通常由企业系统中的REST API提供。然而,直接将这些API作为工具使用存在挑战,因其输入模式复杂、响应烦琐且文档说明常存在歧义。当前工具测试基准未能充分应对这些复杂性,导致评估API对代理驱动自动化的就绪度存在关键缺口。本研究提出一种新颖的测试框架,旨在评估和提升REST API作为基于LLM代理工具的就绪性。该框架将API转化为工具,为API生成全面测试用例,将测试用例转换为适合代理的自然语言指令,丰富工具定义并评估代理正确调用API及处理输入输出的能力。为提供可操作的见解,我们分析了750个测试案例的结果,提出详细的错误分类法,包括输入误解、输出处理不一致和模式失配等问题。此外,我们对测试案例进行分类以简化工具集成的调试与优化。本工作为实现企业API工具化迈出基础性一步,提升了其在基于代理应用中的可用性。
BRIDGE: Benchmarking Large Language Models for Understanding Real-world Clinical Practice Text
Abstract
arXiv:2504.19467v2 Announce Type: replace-cross Abstract: Large language models (LLMs) hold great promise for medical applications and are evolving rapidly, with new models being released at an accelerated pace. However, current evaluations of LLMs in clinical contexts remain limited. Most existing benchmarks rely on medical exam-style questions or PubMed-derived text, failing to capture the complexity of real-world electronic health record (EHR) data. Others focus narrowly on specific application scenarios, limiting their generalizability across broader clinical use. To address this gap, we present BRIDGE, a comprehensive multilingual benchmark comprising 87 tasks sourced from real-world clinical data sources across nine languages. We systematically evaluated 52 state-of-the-art LLMs (including DeepSeek-R1, GPT-4o, Gemini, and Llama 4) under various inference strategies. With a total of 13,572 experiments, our results reveal substantial performance variation across model sizes, languages, natural language processing tasks, and clinical specialties. Notably, we demonstrate that open-source LLMs can achieve performance comparable to proprietary models, while medically fine-tuned LLMs based on older architectures often underperform versus updated general-purpose models. The BRIDGE and its corresponding leaderboard serve as a foundational resource and a unique reference for the development and evaluation of new LLMs in real-world clinical text understanding. The BRIDGE leaderboard: https://huggingface.co/spaces/YLab-Open/BRIDGE-Medical-Leaderboard
摘要
大型语言模型(LLMs)在医疗领域具有广阔的应用前景且发展迅猛,新模型的发布速度持续加快。然而,当前针对临床场景的LLM评估仍存在局限——多数基准测试依赖于医学考试式题目或PubMed衍生的文本,未能体现真实世界电子健康记录(EHR)数据的复杂性;另一些则过度聚焦特定应用场景,限制了其在更广泛临床用途中的普适性。为弥补这一空白,我们提出BRIDGE:一个涵盖九种语言、包含87项源自真实临床数据源任务的综合性多语言基准。我们系统评估了52个前沿LLM(含DeepSeek-R1、GPT-4o、Gemini和Llama 4)在不同推理策略下的表现。通过13,572次实验,结果显示模型性能在参数量级、语言种类、自然语言处理任务及临床专科领域存在显著差异。值得注意的是,我们发现开源LLM能达到与专有模型相当的性能,而基于旧架构的医学微调LLM往往逊色于更新的通用模型。BRIDGE及其对应排行榜为真实世界临床文本理解领域的新LLM开发与评估提供了基础性资源和独特参照标准。BRIDGE排行榜详见:https://huggingface.co/spaces/YLab-Open/BRIDGE-Medical-Leaderboard